Peter Maydell [Tue, 20 Jan 2015 16:19:58 +0000 (16:19 +0000)]
Merge remote-tracking branch 'remotes/pmaydell/tags/pull-misc-20150120' into staging
Miscellaneous cross-tree patches:
* load/store helper cleanup
* drop TARGET_HAS_ICE define and checks
* scripts/qapi-types.py: Add dummy member to empty structs
* cpu_ldst.h: Don't define helpers if MMU_MODE*_SUFFIX not defined
# gpg: Signature made Tue 20 Jan 2015 15:43:38 GMT using RSA key ID 14360CDE
# gpg: Good signature from "Peter Maydell <[email protected]>"
* remotes/pmaydell/tags/pull-misc-20150120:
cpu_ldst.h: Don't define helpers if MMU_MODE*_SUFFIX not defined
cpu_ldst.h, cpu-all.h, bswap.h: Update documentation on ld/st accessors
cpu_ldst_template.h: Drop unused cpu_ldfq/stfq/ldfl/stfl accessors
cpu_ldst.h: Drop unused _raw macros, saddr() and laddr()
cpu_ldst_template.h: Use ld*_p directly rather than via ld*_raw macros
cpu_ldst.h: Use inline functions for usermode cpu_ld/st accessors
cpu_ldst.h: Remove unused very short ld*/st* defines
cpu_ldst.h: Drop unused ld/st*_kernel defines
target-mips: Don't use _raw load/store accessors
linux-user/main.c (m68k): Use get_user_u16 rather than lduw in cpu_loop
linux-user/vm86.c: Use cpu_ldl_data &c rather than plain ldl &c
bsd-user/elfload.c: Don't use ldl() or ldq_raw()
linux-user/elfload.c: Don't use _raw accessor functions
target-sparc: Don't use {ld, st}*_raw functions
monitor.c: Use ld*_p() instead of ld*_raw()
cpu_ldst.h: Remove unused ldul_ macros
exec.c: Drop TARGET_HAS_ICE define and checks
scripts/qapi-types.py: Add dummy member to empty structs
Peter Maydell [Tue, 20 Jan 2015 15:19:35 +0000 (15:19 +0000)]
cpu_ldst.h: Don't define helpers if MMU_MODE*_SUFFIX not defined
Not all targets define a full set of suffix strings for the
NB_MMU_MODES that they have. In this situation, don't define any
helper functions for that mode, rather than defining helper functions
with no suffix at all. The MMU mode is still functional; it is merely
not directly accessible via cpu_ld*_MODE from target helper functions.
Also add an "NB_MMU_MODES >= 2" check to the definition of the mode 1
helpers -- some targets only define one MMU mode.
Peter Maydell [Tue, 20 Jan 2015 15:19:35 +0000 (15:19 +0000)]
cpu_ldst.h, cpu-all.h, bswap.h: Update documentation on ld/st accessors
Add documentation of what the cpu_*_* accessors look like.
Correct some minor errors in the existing documentation of the
direct _p accessor family. Remove the near-duplicate comment
on the _p accessors from cpu-all.h and replace it with a reference
to the comment in bswap.h.
Peter Maydell [Tue, 20 Jan 2015 15:19:34 +0000 (15:19 +0000)]
cpu_ldst_template.h: Drop unused cpu_ldfq/stfq/ldfl/stfl accessors
The cpu_ldfq/stfq/ldfl/stfl accessors for loading and storing
float32 and float64 are completely unused, so delete them.
(The union they use for converting from the float32/float64
type to uint32_t or uint64_t is the wrong way to do it anyway:
they should be using make_float* and float*_val.)
Peter Maydell [Tue, 20 Jan 2015 15:19:34 +0000 (15:19 +0000)]
cpu_ldst_template.h: Use ld*_p directly rather than via ld*_raw macros
The ld*_raw and st*_raw macros are now only used within the code
produced by cpu_ldst_template.h, and only in three places.
Expand these out to just call the ld_p and st_p functions directly.
Note that in all the callsites the address argument is a uintptr_t,
so we can drop that part of the double-cast used in the saddr() and
laddr() macros.
Peter Maydell [Tue, 20 Jan 2015 15:19:34 +0000 (15:19 +0000)]
cpu_ldst.h: Use inline functions for usermode cpu_ld/st accessors
Use inline functions rather than macros for cpu_ld/st accessors
for the *-user configurations, as we already do for softmmu.
This has a two advantages:
* we can actually typecheck our arguments
* we don't need to leak the _raw macros everywhere
Since the _kernel functions were only used by target-i386/seg_helper.c,
put the definitions for them in that file too. (It already has the
similar template include code to define them for the softmmu case,
so it makes sense to have it deal with defining them for user-only.)
Peter Maydell [Tue, 20 Jan 2015 15:19:33 +0000 (15:19 +0000)]
linux-user/main.c (m68k): Use get_user_u16 rather than lduw in cpu_loop
In the m68k cpu_loop() use get_user_u16 to read the immediate for
the simcall rahter than lduw, to bring it into line with how other
archs do it and to remove another user of the ldl family of functions.
Peter Maydell [Tue, 20 Jan 2015 15:19:32 +0000 (15:19 +0000)]
target-sparc: Don't use {ld, st}*_raw functions
Instead of using the _raw family of ld/st accessor functions, use
cpu_*_data. All this code is CONFIG_USER_ONLY, so the two are the
same semantically, but the _raw functions are really a detail of
the implementation which has leaked into a few callsites like this one.
Peter Maydell [Tue, 20 Jan 2015 15:19:32 +0000 (15:19 +0000)]
monitor.c: Use ld*_p() instead of ld*_raw()
The monitor code for doing a memory_dump() was using ld*_raw() to do
target-CPU accesses out of a local buf[] array. The correct functions
for this purpose are ld*_p(), which take a host pointer, rather than
ld*_raw(), which take an integer representing a guest address and
are somewhat meaningless in softmmu configurations. Nobody noticed
because for softmmu the _raw functions are the same as ldl_p but
with some extra casts thrown in. Switch to using the correct functions
instead.
Peter Maydell [Tue, 20 Jan 2015 15:19:32 +0000 (15:19 +0000)]
cpu_ldst.h: Remove unused ldul_ macros
The five ldul_ macros are not used anywhere and are marked up with an XXX
comment. "ldul" is a non-standard prefix for our family of load instructions:
we don't mark 32-bit accesses for signedness because they return a 32 bit
quantity. So just delete them.
Peter Maydell [Tue, 20 Jan 2015 15:19:32 +0000 (15:19 +0000)]
exec.c: Drop TARGET_HAS_ICE define and checks
The TARGET_HAS_ICE #define is intended to indicate whether a target-*
guest CPU implementation supports the breakpoint handling. However,
all our guest CPUs have that support (the only two which do not
define TARGET_HAS_ICE are unicore32 and openrisc, and in both those
cases the bp support is present and the lack of the #define is just
a bug). So remove the #define entirely: all new guest CPU support
should include breakpoint handling as part of the basic implementation.
Peter Maydell [Tue, 20 Jan 2015 15:19:32 +0000 (15:19 +0000)]
scripts/qapi-types.py: Add dummy member to empty structs
Make sure that all generated C structs have at least one field; this
avoids potential issues with attempting to malloc space for
zero-length structs in C (g_malloc(sizeof struct) would return NULL).
It also avoids an incompatibility with C++ (where an empty struct is
size 1); that isn't important to us now but might be in future.
Peter Maydell [Tue, 20 Jan 2015 14:34:38 +0000 (14:34 +0000)]
Merge remote-tracking branch 'remotes/sstabellini/xen-2015-01-20-v2' into staging
* remotes/sstabellini/xen-2015-01-20-v2:
xen: add a lock for the mapcache
xen: do not use __-named variables in mapcache
Xen: Use the ioreq-server API when available
Add device listener interface
Paolo Bonzini [Wed, 14 Jan 2015 10:20:56 +0000 (11:20 +0100)]
xen: add a lock for the mapcache
Extend the existing dummy mapcache_lock/unlock macros to cover all of
xen-mapcache.c. This prepares for unlocked memory access, when parts
of exec.c will not be protected by the BQL.
Paul Durrant [Tue, 20 Jan 2015 11:06:19 +0000 (11:06 +0000)]
Xen: Use the ioreq-server API when available
The ioreq-server API added to Xen 4.5 offers better security than
the existing Xen/QEMU interface because the shared pages that are
used to pass emulation request/results back and forth are removed
from the guest's memory space before any requests are serviced.
This prevents the guest from mapping these pages (they are in a
well known location) and attempting to attack QEMU by synthesizing
its own request structures. Hence, this patch modifies configure
to detect whether the API is available, and adds the necessary
code to use the API if it is.
Paul Durrant [Tue, 20 Jan 2015 11:05:07 +0000 (11:05 +0000)]
Add device listener interface
The Xen ioreq-server API, introduced in Xen 4.5, requires that PCI device
models explicitly register with Xen for config space accesses. This patch
adds a listener interface into qdev-core which can be used by the Xen
interface code to monitor for arrival and departure of PCI devices.
Peter Maydell [Mon, 19 Jan 2015 13:37:05 +0000 (13:37 +0000)]
Merge remote-tracking branch 'remotes/kraxel/tags/pull-console-20150119-1' into staging
ui: add shared surface format negotiation.
# gpg: Signature made Mon 19 Jan 2015 12:47:36 GMT using RSA key ID D3E87138
# gpg: Good signature from "Gerd Hoffmann (work) <[email protected]>"
# gpg: aka "Gerd Hoffmann <[email protected]>"
# gpg: aka "Gerd Hoffmann (private) <[email protected]>"
* remotes/kraxel/tags/pull-console-20150119-1:
ui/sdl2: Support shared surface for more pixman formats
ui/sdl: Support shared surface for more pixman formats
ui/gtk: Support shared surface for most pixman formats
ui/spice: Support shared surface for most pixman formats
ui/vnc: Support shared surface for most pixman formats
ui/pixman: add qemu_pixman_check_format
ui: Add dpy_gfx_check_format() to check backend shared surface support
ui: Make qemu_default_pixman_format() return 0 on unsupported formats
ui: Add dpy_gfx_check_format() to check backend shared surface support
This allows VGA to decide whether to use a shared surface based on
whether the UI backend supports the format or not. Backends that
don't provide the new callback fallback to native 32 bpp which
is equivalent to what was supported before.
Signed-off-by: Benjamin Herrenschmidt <[email protected]>
[ kraxel: fix console check, allow only 32 bpp as fallback ]
ui: Make qemu_default_pixman_format() return 0 on unsupported formats
In order to remove the logic for detecting supported shared
pixmap formats from device models, make qemu_default_pixman_format()
capable for failing by returning 0 which is not a possible format
value rather than asserting.
Laszlo Ersek [Fri, 16 Jan 2015 11:54:30 +0000 (11:54 +0000)]
fw_cfg: fix endianness in fw_cfg_data_mem_read() / _write()
(1) Let's contemplate what device endianness means, for a memory mapped
device register (independently of QEMU -- that is, on physical hardware).
It determines the byte order that the device will put on the data bus when
the device is producing a *numerical value* for the CPU. This byte order
may differ from the CPU's own byte order, therefore when software wants to
consume the *numerical value*, it may have to swap the byte order first.
For example, suppose we have a device that exposes in a 2-byte register
the number of sheep we have to count before falling asleep. If the value
is decimal 37 (0x0025), then a big endian register will produce [0x00,
0x25], while a little endian register will produce [0x25, 0x00].
If the device register is big endian, but the CPU is little endian, the
numerical value will read as 0x2500 (decimal 9472), which software has to
byte swap before use.
However... if we ask the device about who stole our herd of sheep, and it
answers "XY", then the byte representation coming out of the register must
be [0x58, 0x59], regardless of the device register's endianness for
numeric values. And, software needs to copy these bytes into a string
field regardless of the CPU's own endianness.
(2) QEMU's device register accessor functions work with *numerical values*
exclusively, not strings:
The emulated register's read accessor function returns the numerical value
(eg. 37 decimal, 0x0025) as a *host-encoded* uint64_t. QEMU translates
this value for the guest to the endianness of the emulated device register
(which is recorded in MemoryRegionOps.endianness). Then guest code must
translate the numerical value from device register to guest CPU
endianness, before including it in any computation (see (1)).
(3) However, the data register of the fw_cfg device shall transfer strings
*only* -- that is, opaque blobs. Interpretation of any given blob is
subject to further agreement -- it can be an integer in an independently
determined byte order, or a genuine string, or an array of structs of
integers (in some byte order) and fixed size strings, and so on.
Because register emulation in QEMU is integer-preserving, not
string-preserving (see (2)), we have to jump through a few hoops.
(3a) We defined the memory mapped fw_cfg data register as
DEVICE_BIG_ENDIAN.
The particular choice is not really relevant -- we picked BE only for
consistency with the control register, which *does* transfer integers --
but our choice affects how we must host-encode values from fw_cfg strings.
(3b) Since we want the fw_cfg string "XY" to appear as the [0x58, 0x59]
array on the data register, *and* we picked DEVICE_BIG_ENDIAN, we must
compose the host (== C language) value 0x5859 in the read accessor
function.
(3c) When the guest performs the read access, the immediate uint16_t value
will be 0x5958 (in LE guests) and 0x5859 (in BE guests). However, the
uint16_t value does not matter. The only thing that matters is the byte
pattern [0x58, 0x59], which the guest code must copy into the target
string *without* any byte-swapping.
(4) Now I get to explain where I screwed up. :(
When we decided for big endian *integer* representation in the MMIO data
register -- see (3a) --, I mindlessly added an indiscriminate
byte-swizzling step to the (little endian) guest firmware.
This was a grave error -- it violates (3c) --, but I didn't realize it. I
only saw that the code I otherwise intended for fw_cfg_data_mem_read():
value = 0;
for (i = 0; i < size; ++i) {
value = (value << 8) | fw_cfg_read(s);
}
didn't produce the expected result in the guest.
In true facepalm style, instead of blaming my guest code (which violated
(3c)), I blamed my host code (which was correct). Ultimately, I coded
ldX_he_p() into fw_cfg_data_mem_read(), because that happened to work.
Obviously (...in retrospect) that was wrong. Only because my host happened
to be LE, ldX_he_p() composed the (otherwise incorrect) host value 0x5958
from the fw_cfg string "XY". And that happened to compensate for the bogus
indiscriminate byte-swizzling in my guest code.
Clearly the current code leaks the host endianness through to the guest,
which is wrong. Any device should work the same regardless of host
endianness.
The solution is to compose the host-endian representation (2) of the big
endian interpretation (3a, 3b) of the fw_cfg string, and to drop the wrong
byte-swizzling in the guest (3c).
Ard Biesheuvel [Fri, 16 Jan 2015 11:54:29 +0000 (11:54 +0000)]
target-arm: crypto: fix BE host support
The crypto emulation code in target-arm/crypto_helper.c never worked
correctly on big endian hosts, due to the fact that it uses a union
of array types to convert between the native VFP register size (64
bits) and the types used in the algorithms (bytes and 32 bit words)
We cannot just swab between LE and BE when reading and writing the
registers, as the SHA code performs word additions, so instead, add
array accessors for the CRYPTO_STATE type whose LE and BE specific
implementations ensure that the correct array elements are referenced.
* remotes/amit-migration/tags/mig-2.3-1:
vmstate: type-check sub-arrays
migration_cancel: shutdown migration socket
Handle bi-directional communication for fd migration
socket shutdown
Tests: QEMUSizedBuffer/QEMUBuffer
QEMUSizedBuffer: only free qsb that qemu_bufopen allocated
xbzrle: rebuild the cache_is_cached function
xbzrle: optimize XBZRLE to decrease the cache misses
Cristian Klein [Thu, 8 Jan 2015 11:11:31 +0000 (11:11 +0000)]
Handle bi-directional communication for fd migration
libvirt prefers opening the TCP connection itself, for two reasons.
First, connection failed errors can be detected easier, without having
to parse qemu's error output.
Second, libvirt might be asked to secure the transfer by tunnelling the
communication through an TLS layer.
Therefore, libvirt opens the TCP connection itself and passes an FD to qemu
using QMP and a POSIX-specific mechanism.
Hence, in order to make the reverse-path work in such cases, qemu needs to
distinguish if the transmitted FD is a socket (reverse-path available)
or not (reverse-path might not be available) and use the corresponding
abstraction.
Yang Hongyang [Fri, 19 Dec 2014 03:38:05 +0000 (11:38 +0800)]
QEMUSizedBuffer: only free qsb that qemu_bufopen allocated
Only free qsb that qemu_bufopen allocated, and also allow
qemu_bufopen accept qsb as input for write operation. It
will make the API more logical:
1.If you create the QEMUSizedBuffer yourself, you need to
free it by using qsb_free() but not depends on other API
like qemu_fclose.
2.allow qemu_bufopen() accept QEMUSizedBuffer as input for
write operation, otherwise, it will be a little strange
for this API won't accept the second parameter.
This brings API change, since there are only 3
users of this API currently, this change only impact the
first one which will be fixed in patch 2 of this patchset,
so I think it is safe to do this change.
ChenLiang [Mon, 24 Nov 2014 11:55:47 +0000 (19:55 +0800)]
xbzrle: optimize XBZRLE to decrease the cache misses
Avoid hot pages being replaced by others to remarkably decrease cache
misses
Sample results with the test program which quote from xbzrle.txt ran in
vm:(migrate bandwidth:1GE and xbzrle cache size 8MB)
the test program:
include <stdlib.h>
include <stdio.h>
int main()
{
char *buf = (char *) calloc(4096, 4096);
while (1) {
int i;
for (i = 0; i < 4096 * 4; i++) {
buf[i * 4096 / 4]++;
}
printf(".");
}
}
before this patch:
virsh qemu-monitor-command test_vm '{"execute": "query-migrate"}'
{"return":{"expected-downtime":1020,"xbzrle-cache":{"bytes":1108284,
"cache-size":8388608,"cache-miss-rate":0.987013,"pages":18297,"overflow":8,
"cache-miss":1228737},"status":"active","setup-time":10,"total-time":52398,
"ram":{"total":12466991104,"remaining":1695744,"mbps":935.559472,
"transferred":5780760580,"dirty-sync-counter":271,"duplicate":2878530,
"dirty-pages-rate":29130,"skipped":0,"normal-bytes":5748592640,
"normal":1403465}},"id":"libvirt-706"}
18k pages sent compressed in 52 seconds.
cache-miss-rate is 98.7%, totally miss.
Peter Maydell [Thu, 15 Jan 2015 10:08:46 +0000 (10:08 +0000)]
Merge remote-tracking branch 'remotes/mjt/tags/pull-trivial-patches-2015-01-15' into staging
trivial patches for 2015-01-15
# gpg: Signature made Thu 15 Jan 2015 08:26:26 GMT using RSA key ID A4C3D7DB
# gpg: Good signature from "Michael Tokarev <[email protected]>"
# gpg: aka "Michael Tokarev <[email protected]>"
# gpg: aka "Michael Tokarev <[email protected]>"
* remotes/mjt/tags/pull-trivial-patches-2015-01-15:
vl.c: fix some alignment issues
blizzard: do not depend on VGA internals
Makefile: Remove config.status and common.env during 'make distclean'
target-openrisc: bugfix for dec_sys to decode instructions correctly
Do not hang on full PTY
misc: Fix new typos in comments
target-arm: Fix typo in comment (seperately -> separately)
target-tricore: Fix new typos
migration/qemu-file.c: Don't shift left into sign bit
translate-all: Mark map_exec() with the 'unused' attribute
tests/hd-geo-test.c: Remove unused test_image variable
vt82c686: avoid out-of-bounds read
David Morrison [Tue, 6 Jan 2015 17:06:18 +0000 (09:06 -0800)]
target-openrisc: bugfix for dec_sys to decode instructions correctly
Fixed the decoding of "system" instructions (starting with 0x2)
in dec_sys() in translate.c. In particular, the l.trap instruction
is now correctly decoded, which enables for singlestepping and
breakpoints to be set in GDB.
SeokYeon Hwang [Tue, 23 Dec 2014 22:26:54 +0000 (22:26 +0000)]
translate-all: Mark map_exec() with the 'unused' attribute
Mark map_exec() with the 'unused' attribute to avoid '-Wunused-function'
warnings on clang 3.4 or later. This means we don't need to mark it
'inline', which is what we were previously using to suppress the warning
(a trick which only works with gcc, not clang).
Paolo Bonzini [Wed, 10 Dec 2014 09:17:36 +0000 (10:17 +0100)]
vt82c686: avoid out-of-bounds read
superio_ioport_readb can read the 256th element of the array.
Coverity reports an out-of-bounds write in superio_ioport_writeb,
but it does not show the corresponding out-of-bounds read
because it cannot prove that it can happen. Fix the root
cause of the problem (zhanghailang's patch instead fixes
the logic in superio_ioport_writeb).
Peter Maydell [Wed, 14 Jan 2015 18:02:47 +0000 (18:02 +0000)]
Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging
Mostly bugfixes and cleanups from qemu-devel. Yet another small patch from
the record/replay series, and a few SCSI and i386 patches as well.
# gpg: Signature made Wed 14 Jan 2015 09:39:14 GMT using RSA key ID 78C7AE83
# gpg: Good signature from "Paolo Bonzini <[email protected]>"
# gpg: aka "Paolo Bonzini <[email protected]>"
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg: It is not certain that the signature belongs to the owner.
# Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4 E2F7 7E15 100C CD36 69B1
# Subkey fingerprint: F133 3857 4B66 2389 866C 7682 BFFB D25F 78C7 AE83
* remotes/bonzini/tags/for-upstream:
cpus: consistently use QEMU_CLOCK_VIRTUAL_RT for icount_warp_rt timer
qemu-timer: rename timer_init to timer_init_tl
scsi: fix cancellation when I/O was completed but DMA was not.
rules.mak: Fix module build
hw/scsi/lsi53c895a: add support for additional diag / debug registers
qemu-common.h: optimise muldiv64 if int128 is available
target-i386: do not memcpy in and out of xmm_regs
target-i386: fix movntsd on big-endian hosts
vl.c: fix regression when reading memory size from config file
vl: Don't silently change topology when all -smp options were set
vl: fix max_cpus check
vl: Avoid unnecessary 'if' nesting
9pfs: changed to use event_notifier instead of qemu_pipe
vl.c: fix regression when reading machine type from config file
char: restore stdio echo on resume from suspend.
Paolo Bonzini [Mon, 12 Jan 2015 10:47:30 +0000 (11:47 +0100)]
scsi: fix cancellation when I/O was completed but DMA was not.
Commit d577646 (scsi: Introduce scsi_req_cancel_complete, 2014-09-25)
was supposed to have no semantic change, but it missed a case. When
r->aiocb has already been NULLed, but DMA was not complete and the
SCSI layer was waiting for scsi_req_continue, after the patch the
SCSI layer will not call the .cancel callback of SCSIBusInfo.
Fam Zheng [Mon, 12 Jan 2015 04:43:09 +0000 (12:43 +0800)]
rules.mak: Fix module build
Module build is broken since commit c261d774fb ( rules.mak: Fix DSO
build by pulling in archive symbols). That commit added .mo placeholders
of DSO to -y variables, in order to pull stub symbols to executable. But
the placeholders are unintentionally expanded in -y, rather than
filtered out while linking.
Fix it by moving the -objs expanding to before inserting .mo
placeholders. Note that passing -cflags and -libs to member objects are
also moved to keep it happening before object expanding.
Peter Lieven [Mon, 12 Jan 2015 09:45:17 +0000 (10:45 +0100)]
hw/scsi/lsi53c895a: add support for additional diag / debug registers
Some ancient Linux kernels read from registers 0x09 and 0x3c-3f during
boot. According to the spec these registers are for diag and debug
purposes only. If they are absend qemu aborts on read.
Paolo Bonzini [Fri, 24 Oct 2014 07:44:38 +0000 (09:44 +0200)]
target-i386: do not memcpy in and out of xmm_regs
After the next patch, we will move the high parts of AVX and AVX512 registers
in the same array as the SSE registers. This will make it impossible to
memcpy an array of 128-bit values in and out of xmm_regs in one swoop.
Use a for loop instead.
Similarly, always use XMM_Q in translate.c. This avoids introducing bugs
such as the one fixed in the previous patch.
xen-hvm: increase maxmem before calling xc_domain_populate_physmap
Increase maxmem before calling xc_domain_populate_physmap_exact to
avoid the risk of running out of guest memory. This way we can also
avoid complex memory calculations in libxl at domain construction
time.
This patch fixes an abort() when assigning more than 4 NICs to a VM.
Peter Maydell [Tue, 13 Jan 2015 13:49:18 +0000 (13:49 +0000)]
Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' into staging
# gpg: Signature made Tue 13 Jan 2015 13:48:06 GMT using RSA key ID 81AB73C8
# gpg: Good signature from "Stefan Hajnoczi <[email protected]>"
# gpg: aka "Stefan Hajnoczi <[email protected]>"
* remotes/stefanha/tags/block-pull-request: (38 commits)
NVMe: Set correct VS Value for 1.1 Compliant Controllers
MAINTAINERS: Add migration/block* to block subsystem
MAINTAINERS: Update email addresses for Chrysostomos Nanakos
nvme: Fix get/set number of queues feature
ide: Implement VPD response for ATAPI
block: Split BLOCK_OP_TYPE_COMMIT to BLOCK_OP_TYPE_COMMIT_{SOURCE, TARGET}
block: limited request size in write zeroes unsupported path
coroutine: try harder not to delete coroutines
coroutine: drop qemu_coroutine_adjust_pool_size
coroutine: rewrite pool to avoid mutex
QSLIST: add lock-free operations
test-coroutine: avoid overflow on 32-bit systems
qemu-thread: add per-thread atexit functions
coroutine-ucontext: use __thread
qemu-iotests: Add supported os parameter for python tests
qemu-iotests: Add "_supported_os Linux" to 058
qemu-iotests: Replace "/bin/true" with "true"
.gitignore: Ignore generated "common.env"
libqos: Convert malloc-pc allocator to a generic allocator
migration/block: fix pending() return value
...
Alex Friedman [Fri, 5 Dec 2014 12:40:24 +0000 (14:40 +0200)]
nvme: Fix get/set number of queues feature
According to the specification, the low 16 bits should contain the number of
I/O submission queues, and the high 16 bits should contain the number of
I/O completion queues.
John Snow [Wed, 10 Dec 2014 18:17:07 +0000 (13:17 -0500)]
ide: Implement VPD response for ATAPI
SCSI devices have multiple kinds of queries they need to respond
to, as defined in the "cmd inquiry" section in MMC-6 and SPC-3.
Relevent sections:
MMC-6 revision 2g:
Non-VPD response data and pointer to SPC-3;
Section 6.8 "Inquiry Command"
SPC-3 revision 23:
Inquiry command and error handling:
Section 6.4 "INQUIRY command"
VPD data pages format:
Section 7.6 "Vital product data parameters"
We implement these Vital Product Data queries for SCSI, but not for
ATAPI through IDE. The result is that if you are looking for the WWN
identifier via tools such as sg3_utils, you will be unable to query
our CD/DVD rom device to obtain it.
This patch adds the minimum number of mandatory responses as defined
by SPC-3, which include the "supported pages" response (page 0x00)
and the "Device Identification" response (page 0x83). It also correctly
responds when it receives a request for an illegal page to improve
error output from related tools.
The Device ID page contains an arbitrary list of identification
strings of various formats; the ID strings included in this patch
were chosen to mimic those provided by the libata driver when
emulating this SCSI query (model, serial, and wwn when present.)
Example:
# libata emulated response
[root@localhost ~]# sg_inq --id /dev/sda
VPD INQUIRY: Device Identification page
Designation descriptor number 1, descriptor length: 24
designator_type: vendor specific [0x0], code_set: ASCII
associated with the addressed logical unit
vendor specific: QM00001
Designation descriptor number 2, descriptor length: 72
designator_type: T10 vendor identification, code_set: ASCII
associated with the addressed logical unit
vendor id: ATA
vendor specific: QEMU HARDDISK QM00001
# QEMU generated ATAPI response, with WWN
[root@localhost ~]# sg_inq --id /dev/sr0
VPD INQUIRY: Device Identification page
Designation descriptor number 1, descriptor length: 24
designator_type: vendor specific [0x0], code_set: ASCII
associated with the addressed logical unit
vendor specific: QM00005
Designation descriptor number 2, descriptor length: 72
designator_type: T10 vendor identification, code_set: ASCII
associated with the addressed logical unit
vendor id: ATA
vendor specific: QEMU DVD-ROM QM00005
Designation descriptor number 3, descriptor length: 12
designator_type: NAA, code_set: Binary
associated with the addressed logical unit
NAA 5, IEEE Company_id: 0xc50
Vendor Specific Identifier: 0x15ea71bb
[0x5000c50015ea71bb]
See also: hw/scsi/scsi-disk.c, scsi_disk_emulate_inquiry()
block: Split BLOCK_OP_TYPE_COMMIT to BLOCK_OP_TYPE_COMMIT_{SOURCE, TARGET}
Like BLOCK_OP_TYPE_BACKUP_SOURCE and BLOCK_OP_TYPE_BACKUP_TARGET,
block-commit involves two asymmetric devices.
This change is not user-visible (yet), because commit only works with
device names.
But once we enable backing reference in blockdev-add, or specifying
node-name in block-commit command, we don't want the user to start two
commit jobs on the same backing chain, which will corrupt things because
of the final bdrv_swap.
Before we have per category blockers, splitting this type is still
better.
[Resolved virtio-blk dataplane conflict by replacing
BLOCK_OP_TYPE_COMMIT with both BLOCK_OP_TYPE_COMMIT_{SOURCE, TARGET}.
They are safe since the block job runs in the same AioContext as the
dataplane IOThread.
--Stefan]
Peter Lieven [Mon, 5 Jan 2015 11:29:49 +0000 (12:29 +0100)]
block: limited request size in write zeroes unsupported path
If bs->bl.max_write_zeroes is large and we end up in the unsupported
path we might allocate a lot of memory for the iovector and/or even
generate an oversized requests.
Fix this by limiting the request by the minimum of the reported
maximum transfer size or 16MB (32768 sectors).
Peter Lieven [Tue, 2 Dec 2014 11:05:50 +0000 (12:05 +0100)]
coroutine: try harder not to delete coroutines
Placing coroutines on the global pool should be preferrable, because it
can help all threads. But if the global pool is full, we can still
try to save some allocations by stashing completed coroutines on the
local pool. This is quite cheap too, because it does not require
atomic operations, and provides a gain of 15% in the best case.
Paolo Bonzini [Tue, 2 Dec 2014 11:05:48 +0000 (12:05 +0100)]
coroutine: rewrite pool to avoid mutex
This patch removes the mutex by using fancy lock-free manipulation of
the pool. Lock-free stacks and queues are not hard, but they can suffer
from the ABA problem so they are better avoided unless you have some
deferred reclamation scheme like RCU. Otherwise you have to stick
with adding to a list, and emptying it completely. This is what this
patch does, by coupling a lock-free global list of available coroutines
with per-CPU lists that are actually used on coroutine creation.
Whenever the destruction pool is big enough, the next thread that runs
out of coroutines will steal the whole destruction pool. This is positive
in two ways:
1) the allocation does not have to do any atomic operation in the fast
path, it's entirely using thread-local storage. Once every POOL_BATCH_SIZE
allocations it will do a single atomic_xchg. Release does an atomic_cmpxchg
loop, that hopefully doesn't cause any starvation, and an atomic_inc.
A later patch will also remove atomic operations from the release path,
and try to avoid the atomic_xchg altogether---succeeding in doing so if
all devices either use ioeventfd or are not submitting requests actively.
2) in theory this should be completely adaptive. The number of coroutines
around should be a little more than POOL_BATCH_SIZE * number of allocating
threads; so this also empties qemu_coroutine_adjust_pool_size. (The previous
pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit
more generous. But if you actually have many high-iodepth disks, it's better
to put them in different iothreads, which will also use separate thread
pools and aio=native file descriptors).
This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33.
No matter if we end with some kind of coroutine bypass scheme or not,
it cannot hurt to optimize hot code.
Paolo Bonzini [Tue, 2 Dec 2014 11:05:47 +0000 (12:05 +0100)]
QSLIST: add lock-free operations
These operations are trivial to implement and do not have ABA problems.
They are enough to implement simple multiple-producer, single consumer
lock-free lists or, as in the next patch, the multiple consumers can
steal a whole batch of elements and process them at their leisure.
Paolo Bonzini [Tue, 2 Dec 2014 11:05:45 +0000 (12:05 +0100)]
qemu-thread: add per-thread atexit functions
Destructors are the main additional feature of pthread TLS compared
to __thread. If we were using C++ (hint, hint!) we could have used
thread-local objects with a destructor. Since we are not, instead,
we add a simple Notifier-based API.
Note that the notifier must be per-thread as well. We can add a
global list as well later, perhaps.
The Win32 implementation has some complications because a) detached
threads used not to have a QemuThreadData; b) the main thread does
not go through win32_start_routine, so we have to use atexit too.
Paolo Bonzini [Tue, 2 Dec 2014 11:05:44 +0000 (12:05 +0100)]
coroutine-ucontext: use __thread
ELF thread local storage is about 10% faster on tests/test-coroutine's
perf/cost test. The timing on my machine is 190ns per iteration with
pthread TLS, 170 with ELF TLS.
Based on a patch by Kevin Wolf and Peter Lieven, but redone to follow
the model of coroutine-win32.c (including the important "noinline"
attribute!).
Platforms without thread-local storage (OpenBSD probably?) will need
a new-enough GCC for this to compile, in order to use the same emutls
support that Windows already relies on.
Fam Zheng [Sun, 4 Jan 2015 01:53:52 +0000 (09:53 +0800)]
qemu-iotests: Add supported os parameter for python tests
If I understand correctly, qemu-iotests never meant to be portable. We
only support Linux for all the shell cases, but didn't specify it for
python tests. Now add this and default all the python tests as Linux
only. If we cares enough later, we can override the parameter in
individual cases.
Liang Li [Tue, 13 Jan 2015 02:40:53 +0000 (10:40 +0800)]
xen-pt: Fix PCI devices re-attach failed
Use the 'xl pci-attach $DomU $BDF' command to attach more than
one PCI devices to the guest, then detach the devices with
'xl pci-detach $DomU $BDF', after that, re-attach these PCI
devices again, an error message will be reported like following:
libxl: error: libxl_qmp.c:287:qmp_handle_error_response: receive
an error message from QMP server: Duplicate ID 'pci-pt-03_10.1'
for device.
If using the 'address_space_memory' as the parameter of
'memory_listener_register', 'xen_pt_region_del' will not be called
if the memory region's name is not 'xen-pci-pt-*' when the devices
is detached. This will cause the device's related QemuOpts object
not be released properly.
Using the device's address space can avoid such issue, because the
calling count of 'xen_pt_region_add' when attaching and the calling
count of 'xen_pt_region_del' when detaching is the same, so all the
memory region ref and unref by the 'xen_pt_region_add' and
'xen_pt_region_del' can be released properly.
Marc Marí [Thu, 23 Oct 2014 08:12:42 +0000 (10:12 +0200)]
libqos: Convert malloc-pc allocator to a generic allocator
The allocator in malloc-pc has been extracted, so it can be used in every arch.
This operation showed that both the alloc and free functions can be also
generic.
Because of this, the QGuestAllocator has been removed from is function to wrap
the alloc and free function, and now just contains the allocator parameters.
As a result, only the allocator initalizer and unitializer are arch dependent.
Because of wrong return value of .save_live_pending() in
migration/block.c, migration finishes before the whole disk is
transferred. Such situation occurs when the migration process is fast
enough, for example when source and dest are on the same host.
If in the bulk phase we return something < max_size, we will skip
transferring the tail of the device. Currently we have "set pending to
BLOCK_SIZE if it is zero" for bulk phase, but there no guarantee, that
it will be < max_size.
True approach is to return, for example, max_size+1 when we are in the
bulk phase.
block: fix spoiling all dirty bitmaps by mirror and migration
Mirror and migration use dirty bitmaps for their purposes, and since
commit [block: per caller dirty bitmap] they use their own bitmaps, not
the global one. But they use old functions bdrv_set_dirty and
bdrv_reset_dirty, which change all dirty bitmaps.
Named dirty bitmaps series by Fam and Snow are affected: mirroring and
migration will spoil all (not related to this mirroring or migration)
named dirty bitmaps.
This patch fixes this by adding bdrv_set_dirty_bitmap and
bdrv_reset_dirty_bitmap, which change concrete bitmap. Also, to prevent
such mistakes in future, old functions bdrv_(set,reset)_dirty are made
static, for internal block usage.