Paolo Bonzini [Tue, 8 May 2012 14:51:58 +0000 (16:51 +0200)]
stream: do not copy unallocated sectors from the base
Unallocated sectors should really never be accessed by the guest,
so there's no need to copy them during the streaming process.
If they are read by the guest during streaming, guest-initiated
copy-on-read will copy them (we're in the base == NULL case, which
enables copy on read). If they are read after we disconnect the
image from the base, they will read as zeroes anyway.
Paolo Bonzini [Tue, 8 May 2012 14:51:57 +0000 (16:51 +0200)]
stream: fix ratelimiting corner case
This fixes inability to make progress in streaming if the quota is set
to less than the amount of data that an I/O operation has to write.
In this case, limit->dispatched + n will always be above the quota and,
due to the "goto retry" to recheck cancellation and allocation, streaming
will livelock.
This can be reproduced with "block_job_set_speed ide0-hd0 1b". Of course,
with this patch the requested limit will not be obeyed. That could be
done with another patch that caps is_allocated's n argument by the slice
quota.
Paolo Bonzini [Tue, 8 May 2012 14:51:55 +0000 (16:51 +0200)]
stream: pass new base image format to bdrv_change_backing_file
When an image is modified to point to the new backing file, the backing
file format is set to NULL, which means auto-probe. This is wrong, in
fact it is a small security problem.
Paolo Bonzini [Tue, 8 May 2012 14:51:51 +0000 (16:51 +0200)]
qemu-io: correctly print non-integer values as decimals
qemu-io's cvtstr function sometimes will incorrectly omit the
decimal part of the number, and sometimes will incorrectly include
it. This patch fixes both. The former is more serious, and can
be seen in the patches to 027.out and 033.out.
The changes to all other files were scripted with sed, so there were
no "surprises" beyond 027.out and 033.out.
Paolo Bonzini [Tue, 8 May 2012 14:51:50 +0000 (16:51 +0200)]
qemu-img: make "info" backing file output correct and easier to use
qemu-img info should use the same logic as qemu when printing the
backing file path, or debugging becomes quite tricky. We can also
simplify the output in case the backing file has an absolute path
or a protocol.
Paolo Bonzini [Tue, 8 May 2012 14:51:48 +0000 (16:51 +0200)]
block: protect path_has_protocol from filenames with colons
path_has_protocol will erroneously return "true" if the colon is part
of a filename. These names are common with stable device names produced
by udev. We cannot fully protect against this in case the filename
does not have a path component (e.g. if the current directory is
/dev/disk/by-path), but in the common case there will be a slash before
and path_has_protocol can easily detect that and return false.
Paolo Bonzini [Tue, 8 May 2012 14:51:47 +0000 (16:51 +0200)]
block: simplify path_is_absolute
On Windows, all the logic is already in is_windows_drive and
is_windows_drive_prefix. On POSIX, there is no need to look
out for colons.
The win32 code changes the behaviour in some cases, we could have
something like "d:foo.img". The old code would treat it as relative
path, the new one as absolute. Now the path is absolute, because to
go from c:/program files/blah to d:foo.img you cannot say c:/program
files/blah/d:foo.img. You have to say d:foo.img. But you could also
say it's relative because (I think, at least it was like that in DOS
15 years ago) d:foo.img is relative to the current path of drive D.
Considering how path_is_absolute is used by path_combine, I think it's
better to treat it as absolute.
Paolo Bonzini [Tue, 8 May 2012 14:51:46 +0000 (16:51 +0200)]
block: wait for job callback in block_job_cancel_sync
The limitation on not having I/O after cancellation cannot really be
kept. Even streaming has a very small race window where you could
cancel a job and have it report completion. If this window is hit,
bdrv_change_backing_file() will yield and possibly cause accesses to
dangling pointers etc.
So, let's just assume that we cannot know exactly what will happen
after the coroutine has set busy to false. We can set a very lax
condition:
- if we cancel the job, the coroutine won't set it to false again
(and hence will not call co_sleep_ns again).
- block_job_cancel_sync will wait for the coroutine to exit, which
pretty much ensures no race.
Instead, we track the coroutine that executes the job and put very
strict conditions on what to do while it is quiescent (busy = false).
First of all, the coroutine must never set busy = false while the job
has been cancelled. Second, the coroutine can be reentered arbitrarily
while it is quiescent, so you cannot really do anything but co_sleep_ns at
that time. This condition is obeyed by the block_job_sleep_ns function.
Paolo Bonzini [Tue, 8 May 2012 14:51:44 +0000 (16:51 +0200)]
block: fully delete bs->file when closing
We are reusing bs->file across close/open, which may not cause any
known bugs but is a recipe for trouble. Prefer bdrv_delete, and
enjoy the new invariant in the implementation of bdrv_delete.
Paolo Bonzini [Tue, 8 May 2012 14:51:43 +0000 (16:51 +0200)]
block: do not reuse the backing file across bdrv_close/bdrv_open
This is another bug caused by not doing a full cleanup of the BDS
across close/open. This was found with mirroring by Shaolong Hu,
but it can probably be reproduced also with eject or change.
Paolo Bonzini [Tue, 8 May 2012 14:51:42 +0000 (16:51 +0200)]
block: another bdrv_append fix
bdrv_append must also copy open_flags to the top, because the snapshot
has BDRV_O_NO_BACKING set. This causes interesting results if you
later use drive-reopen (not upstream) to reopen the image, and lose
the backing file in the process.
Paolo Bonzini [Tue, 8 May 2012 14:51:41 +0000 (16:51 +0200)]
block: fix snapshot on QED
QED's opaque data includes a pointer back to the BlockDriverState.
This breaks when bdrv_append shuffles data between bs_new and bs_top.
To avoid this, add a "rebind" function that tells the driver about
the new relationship between the BlockDriverState and its opaque.
The patch also adds rebind to VVFAT for completeness, even though
it is not used with live snapshots.
Paolo Bonzini [Thu, 12 Apr 2012 12:01:05 +0000 (14:01 +0200)]
qemu-iotests: strip spaces from qemu-img/qemu-io/qemu command lines
A trailing space is left when qemu-img has no arguments, for example if
-nocache is not used. This becomes an empty argument after split()
and causes qemu-io to fail.
Paolo Bonzini [Thu, 12 Apr 2012 12:01:04 +0000 (14:01 +0200)]
block: fix allocation size for dirty bitmap
Also reuse elsewhere the new constant for sizeof(unsigned long) * 8.
The dirty bitmap is allocated in bits but declared as unsigned long.
Thus, its memory block is accessed beyond its end unless the image
is a multiple of 64 chunks (i.e. a multiple of 64 MB).
Paolo Bonzini [Thu, 12 Apr 2012 12:01:03 +0000 (14:01 +0200)]
block: open backing file as read-only when probing for size
bdrv_img_create will temporarily open the backing file to probe its size.
However, this could be done with a read-write open if the wrong flags are
passed to bdrv_img_create. Since there is really no documentation on
what flags can be passed, assume that bdrv_img_create receives the flags
with which the new image will be opened; sanitize them when opening
the backing file.
Paolo Bonzini [Thu, 12 Apr 2012 12:01:02 +0000 (14:01 +0200)]
block: update in-memory backing file and format
These are needed to print "info block" output correctly. QCOW2 does this
because it needs it to write the header, but QED does not, and common code
is the right place to do it.
block: add the support to drain throttled requests
Signed-off-by: Zhi Yong Wu <[email protected]>
[ Iterate until all block devices have processed all requests,
add comments. - Paolo ] Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Kevin Wolf <[email protected]>
Andreas Färber [Wed, 9 May 2012 17:26:56 +0000 (19:26 +0200)]
tcg/ppc: Do not overwrite lower address word on Darwin and AIX
For targets where TARGET_LONG_BITS != 32, i.e. 64-bit guests,
addr_reg is moved to r4. For hosts without TCG_TARGET_CALL_ALIGN_ARGS
either data_reg2 or data_reg or a masked version thereof would overwrite
r4. Place it in r5 instead, matching TCG_TARGET_CALL_ALIGN_ARGS hosts.
This fixes immediate crashes of 64-bit guests observed on Darwin/ppc but
not on Darwin/ppc64.
Anthony Liguori [Tue, 8 May 2012 18:07:41 +0000 (13:07 -0500)]
Merge remote-tracking branch 'qmp/queue/qmp' into staging
* qmp/queue/qmp:
hmp: fix bad value conversion for M type
hmp: expr_unary(): check for overflow in strtoul()/strtoull()
vl: drop is_suspended variable
runstate: introduce suspended state
qapi-schema.json: fix RunState enums alphabetical order
wakeup on migration
QEMU enters in this state when the guest suspends to ram (S3).
This is important so that HMP users and QMP clients can know that
the guest is suspended. QMP also has an event for this, but events
are not reliable and are limited (ie. a client can connect to QEMU
after the event has been emitted).
Having a different state for S3 brings a new issue, though. Every
device that doesn't run when the VM is stopped but wants to run
when the VM is suspended has to check for RUN_STATE_SUSPENDED
explicitly. This is the case for the keyboard and mouse devices,
for example.
Gerd Hoffmann [Wed, 7 Mar 2012 07:00:26 +0000 (08:00 +0100)]
wakeup on migration
Wakeup the guest when the live part of the migation is finished.
This avoids being in suspended state on migration, so we don't
have to save the is_suspended bit.
Peter Maydell [Thu, 3 May 2012 18:32:15 +0000 (19:32 +0100)]
user-exec.c: Don't assert on segfaults for non-valid addresses
h2g() will assert if passed an address that's not a valid guest address,
so handle_cpu_signal() needs to check before passing "data address
which caused a segfault" to it, since for a misbehaving guest
that could be anything. If the address isn't a valid guest address
then we can simply skip the attempt to unprotect a guest page
which was made read-only to catch self-modifying code.
This assertion probably fires more readily now than it used to
do because of recent changes to default to reserving guest address
space.
Anthony Liguori [Tue, 8 May 2012 14:38:41 +0000 (09:38 -0500)]
Merge remote-tracking branch 'kwolf/for-anthony' into staging
* kwolf/for-anthony:
fdc: simplify media change handling
qcow2: lock on prealloc
block: make bdrv_create adopt coroutine
qcow2: Limit COW to where it's needed
sheepdog: switch to writethrough mode if cluster doesn't support flush
Anthony Liguori [Tue, 8 May 2012 14:37:12 +0000 (09:37 -0500)]
Merge remote-tracking branch 'bonzini/scsi-next' into staging
* bonzini/scsi-next:
scsi: Add assertion for use-after-free errors
scsi: remove useless debug messages
scsi: set VALID bit to 0 in fixed format sense data
scsi: do not require a minimum allocation length for REQUEST SENSE
scsi: do not require a minimum allocation length for INQUIRY
scsi: parse 16-byte tape CDBs
scsi: do not report bogus overruns for commands in the 0x00-0x1F range
scsi-disk: add dpofua property
scsi: change "removable" field to host many features
scsi: Specify the xfer direction for UNMAP and ATA_PASSTHROUGH commands
scsi: fix WRITE SAME transfer length and direction
scsi: fix refcounting for reads
scsi: prevent data transfer overflow
ISCSI: Add support for thin-provisioning via discard/UNMAP and bigger LUNs
Avi Kivity [Mon, 7 May 2012 12:00:45 +0000 (15:00 +0300)]
rtl8139: fix regression in TxStatus/TxAddr read
Commit afe0a595356192 added byte reads for TxStatus/TxAddr, but
broke 32-bit reads; the mask generation
(1 << (8 * size)) - 1
is unspecified in C for size >= sizeof(int), and in fact returns 0
on x86.
Fix by using a larger type.
Fixes (at least) Fedora 9 i386 with -machine kernel_irqchip=on. I
didn't see it with the qemu APIC implementation; may be due to timing
or (more likely) a tester error.
This also (partly) fixes IBM OS/2 Warp 4.0 floppy installation, where
not all floppies have the same format (2x80x18 for the first ones,
2x80x23 for the next ones).
preallocate() will be locked. This is required because
qcow2_alloc_cluster_link_l2() assumes that it runs under a lock that it
can drop while COW is being performed.
Kevin Wolf [Thu, 26 Apr 2012 17:41:22 +0000 (19:41 +0200)]
qcow2: Limit COW to where it's needed
This fixes a regression introduced in commit 250196f1. The bug leads to
data corruption, found during an Autotest run with a Fedora 8 guest.
Consider a write request whose first part is covered by an already
allocated cluster, but additional clusters need to be newly allocated.
When counting the number of clusters to allocate, the qcow2 code would
decide to do COW for all remaining clusters of the write request, even
if some of them are already allocated.
If during this COW operation another write request is issued that touches
the same cluster, it will still refer to the old cluster. When the COW
completes, the first request will update the L2 table and the second
write request will be lost. Note that the requests need not overlap, it's
enough for them to touch the same cluster.
This patch ensures that only clusters that really require COW are
considered for allocation. In this case any other request writing to the
same cluster will be an allocating write and gets serialised.
Hans de Goede [Mon, 7 May 2012 07:24:37 +0000 (09:24 +0200)]
hw/ac97: Mask out unused bits of volume controls
The Linux ac97 drivers does a number of register read/write tests to
see how much resolution a volume control actually has.
This patch takes this into account by masking out any bits written to
a volume control reg which should not be there according to the spec.
After this the Linux ac97 driver correctly uses a range of 0 - 0x1f for
the PCM out volume, as stated in the spec, and we can fix the FIXME
in update_combined_volume_out().
This patch was also tested with a Windows XP guest without any issues.
We are (correctly) using AC97_Record_Gain_Mute and not AC97_Line_In_Volume_Mute
for recording volume, but various places in hw/ac97 were still assumimg that
we are using AC97_Line_In_Volume_Mute for record volume control, this patch
fixes this.
Hans de Goede [Mon, 7 May 2012 07:24:35 +0000 (09:24 +0200)]
hw/ac97: Make a bunch of mixer registers read only
The Linux ac97 driver tries to see if optional things like video input
volume control are available in 2 ways:
1) See if the mute bit is set after reset, if it is no further tests are done
2) If the mute bit is not set it does a write/read test of the mute bit
This patch changes our ac97 to conform to what the Linux driver expects, it
initializes registers for things which we don't emulate to 0 (so the mute bit
is not set) and makes them read only.
This causes Linux to now longer show the following (functionless)
controls in alsamixer:
Master Mono vol + mute
3d Control toggle
PCM out pre / post 3d select
Surround toggle
CD vol + mute
Mic vol + mute
Mic boost toggle
Mic mic1 / mic2 select
Video vol + mute
Phone vol + mute
Beep mono vol + mute
Aux vol + mute
Mono "output mic" / "mix" select
Sigmatel 4 speaker stereo toggle
Sigmatel ADC 6Db att toggle
Sigmatel DAC 6Db att toggle
This patch was also tested with a Windows XP guest and there it also makes
a number of functionless mixer controls go away.
Stefan Weil [Fri, 4 May 2012 06:51:16 +0000 (08:51 +0200)]
scsi: Add assertion for use-after-free errors
The QEMU emulation which is currently used with Raspberry PI images
(qemu-system-arm -M versatilepb ...) accesses memory which was freed.
Valgrind output (extract):
==17857== Invalid write of size 4
==17857== at 0x24EB06: scsi_req_unref (scsi-bus.c:1273)
==17857== by 0x24FFAE: scsi_read_complete (scsi-disk.c:277)
==17857== by 0x152ACC: bdrv_co_em_bh (block.c:3363)
==17857== by 0x13D49C: qemu_bh_poll (async.c:71)
==17857== by 0x211A8C: main_loop_wait (main-loop.c:503)
==17857== by 0x207954: main_loop (vl.c:1555)
==17857== by 0x20E9C9: main (vl.c:3653)
==17857== Address 0x1c54383c is 12 bytes inside a block of size 260 free'd
==17857== at 0x4824B3A: free (vg_replace_malloc.c:366)
==17857== by 0x20ADFA: free_and_trace (vl.c:2250)
==17857== by 0x4899FC5: g_free (in /lib/libglib-2.0.so.0.2400.1)
==17857== by 0x24EB3B: scsi_req_unref (scsi-bus.c:1277)
==17857== by 0x24F003: scsi_req_complete (scsi-bus.c:1383)
==17857== by 0x25022A: scsi_read_data (scsi-disk.c:334)
==17857== by 0x24EB9F: scsi_req_continue (scsi-bus.c:1289)
==17857== by 0x1C7787: lsi_do_dma (lsi53c895a.c:575)
==17857== by 0x1C8CDA: lsi_execute_script (lsi53c895a.c:1147)
==17857== by 0x1C74EA: lsi_resume_script (lsi53c895a.c:510)
==17857== by 0x1C7ECD: lsi_transfer_data (lsi53c895a.c:746)
==17857== by 0x24EC90: scsi_req_data (scsi-bus.c:1307)
(There are some more similar messages.)
This patch adds an assertion which also detects those errors:
Calling scsi_req_unref is not allowed when the previous call
of that function has decremented refcount to 0, because in this
case req was freed.
Paolo Bonzini [Thu, 3 May 2012 16:26:13 +0000 (18:26 +0200)]
scsi: remove useless debug messages
Optional inquiry information is declared obsolete in the latest versions
of the standard; invalid CDBs or unsupported VPD pages are supported
can be diagnosed with trace_scsi_inquiry.
Paolo Bonzini [Thu, 3 May 2012 13:28:05 +0000 (15:28 +0200)]
scsi: do not report bogus overruns for commands in the 0x00-0x1F range
Interpreting cdb[4] == 0 as a request to transfer 256 blocks is only
needed for READ_6 and WRITE_6. No other command in that range needs
that special-casing, and the resulting overrun breaks scsi-testsuite's
attempt to use command 2 as a known-invalid command.
Paolo Bonzini [Tue, 1 May 2012 08:25:16 +0000 (10:25 +0200)]
scsi-disk: add dpofua property
Linux expects REQ_FUA to be advertised only if WRITE+FUA is faster than
WRITE+SYNCHRONIZE CACHE, so we should not set the DPOFUA bit. However,
it is useful to have it for testing purposes, so add a qdev property to
set it.
Paolo Bonzini [Tue, 1 May 2012 08:23:54 +0000 (10:23 +0200)]
scsi: change "removable" field to host many features
It is pointless to add a uint32_t field for every new feature.
Since we will need a new feature soon, convert accesses to "removable"
to look at bit 0 only.
Ronnie Sahlberg [Sat, 28 Apr 2012 13:49:36 +0000 (23:49 +1000)]
scsi: Specify the xfer direction for UNMAP and ATA_PASSTHROUGH commands
scsi_cmd_xfer_mode() is used to specify the xfer direction for SCSI
commands that come in from the guest. If the direction is set incorrectly
this will eventually cause QEMU to kernel-panic the guest.
Add UNMAP and ATAPASSTHROUGH as commands that send data to the device.
Without this change, recent kernels will send both UNMAP as well
as ATAPASSTHROUGH commands to any /dev/sg* device, which due to the
incorrect xfer direction very quickly causes the guest kernel to crash.
Example causing a crash without the patch applied:
Paolo Bonzini [Tue, 24 Apr 2012 06:41:04 +0000 (08:41 +0200)]
scsi: fix refcounting for reads
Recently introduced FUA support also gave us a use-after-free
of the BlockAcctCookie within a SCSIDiskReq, due to unbalanced
reference counting.
The patch fixes this by making scsi_do_read look like a combination
of scsi_*_complete + scsi_*_data. It does both a ref (like
scsi_read_data) and an unref (like scsi_flush_complete).
Ronnie Sahlberg [Tue, 24 Apr 2012 06:29:04 +0000 (16:29 +1000)]
ISCSI: Add support for thin-provisioning via discard/UNMAP and bigger LUNs
Update the configure test for libiscsi support to detect version 1.3
or later. Version 1.3 of libiscsi provides both READCAPACITY16 as well
as UNMAP commands.
Update the iscsi block layer to use READCAPACITY16 to detect the size of
the LUN instead of READCAPACITY10. This allows support for LUNs larger
than 2TB.
Update to implement bdrv_aio_discard() using the UNMAP command.
This allows us to use thin-provisioned LUNs from TGTD and other iSCSI
targets that support thin-provisioning.
Signed-off-by: Ronnie Sahlberg <[email protected]>
[squashed in subsequent patch from Ronnie to fix off-by-one in LBA count] Signed-off-by: Paolo Bonzini <[email protected]>
Alexander Graf [Tue, 1 May 2012 15:30:28 +0000 (16:30 +0100)]
linux-user: fix emulation of /proc/self/maps
Improve the emulation of /proc/self/maps by reading the underlying
host maps file and passing lines through with addresses adjusted
to be guest addresses. This is necessary to avoid false triggers
of the glibc check that a format string containing '%n' is not in
writable memory. (For an example see the bug reported in
https://bugs.launchpad.net/qemu-linaro/+bug/947888 where gpg aborts.)
Alon Levy [Wed, 7 Mar 2012 14:19:03 +0000 (16:19 +0200)]
spice: require spice-protocol >= 0.8.1
Requiring spice-server >= 0.8.2 is not enough since spice-server.pc
doesn't require spice-protocol (any version). Until that is fixed
upstream an explicit requirement in qemu fixes compilation broken since
Stefan Weil [Fri, 27 Apr 2012 05:34:40 +0000 (05:34 +0000)]
qemu-timer: Fix limits for w32 mmtimer
timeSetEvent only accepts delays in the range which is returned by
timeGetDevCaps.
The lower limit is typically 1 (= 1 ms), so the constant value of 1
in the old code usually worked.
The upper limit can be as low as 10000 ms, so the latest changes in
QEMU's timer handling which introduced timeout values above that limit
could result in failures of timeSetEvent when the timer was re-armed.
Stefan Weil [Sat, 28 Apr 2012 05:07:47 +0000 (05:07 +0000)]
arm-semi: Rename SYS_XXX macros to TARGET_SYS_XXX (fixes compiler warning)
SYS_OPEN is already defined in stdio.h of MinGW-w64,
therefore the compiler complains when building for w64.
Adding the prefix TARGET_ avoids that macro redefinition.
xtensa-semi.c also uses the same prefix (but mixed case macros
TARGET_SYS_xxx instead of TARGET_SYS_XXX).