This is a simple reconnect test, that simply checks if vhost-user
reconnection is possible and restore the state. A more complete test
would actually manipulate and check the ring contents (such extended
testing would benefit from the libvhost-user proposed in QEMU list to
avoid duplication of ring manipulations)
A driver may change the vring enable state at run time but vhost-user
backend may not be present (a contrived example is when the backend is
disconnected and the device is reconfigured after driver rebinding)
Restore the vring state when the vhost-user backend is started, so it
can process the ring.
Do not crash when backend is not present while enabling the ring. A
following patch will save the enabled state so it can be restored once
the backend is started.
If the backend failed to start (for example feature negociation failed),
do not exit, but disconnect the char device instead. Slightly more
robust for reconnect case.
Tetsuya Mukawa [Mon, 6 Jun 2016 16:45:02 +0000 (18:45 +0200)]
qemu-char: add qemu_chr_disconnect to close a fd accepted by listen fd
The patch introduces qemu_chr_disconnect(). The function is used for
closing a fd accepted by listen fd. Though we already have qemu_chr_delete(),
but it closes not only accepted fd but also listen fd. This new function
is used when we still want to keep listen fd.
tests/vhost-user-bridge: workaround stale vring base
This patch is a similar solution to what Yuanhan Liu/Huawei Xie have
suggested for DPDK. When vubr quits (killed or crashed), a restart of
vubr would get stale vring base from QEMU. That would break the kernel
virtio net completely, making it non-work any more, unless a driver
reset is done.
So, instead of getting the stale vring base from QEMU, Huawei suggested
we could get a proper one from used->idx. This works because the queues
packets are processed in order.
Tetsuya Mukawa [Mon, 6 Jun 2016 16:44:59 +0000 (18:44 +0200)]
vhost-user: add ability to know vhost-user backend disconnection
Current QEMU cannot detect vhost-user backend disconnection. The
patch adds ability to know it.
To know disconnection, add watcher to detect G_IO_HUP event. When
G_IO_HUP event is detected, the disconnected socket will be read
to cause a CHR_EVENT_CLOSED.
Peter Xu [Tue, 17 May 2016 11:26:10 +0000 (19:26 +0800)]
pci: fix pci_requester_id()
This fix SID verification failure when IOMMU IR is enabled with PCI
bridges. Existing pci_requester_id() is more like getting BDF info
only. Renaming it to pci_get_bdf(). Meanwhile, we provide the correct
implementation to get requester ID. VT-d spec 5.1.1 is a good reference
to go, though it talks only about interrupt delivery, the rule works
exactly the same for non-interrupt cases.
Currently, there are three use cases for pci_requester_id():
- PCIX status bits: here we need BDF only, not requester ID. Replacing
with pci_get_bdf().
- PCIe Error injection and MSI delivery: for both these cases, we are
looking for requester IDs. Here we should use the new impl.
To avoid a PCI walk every time we send MSI message, one requester_id
cache field is added to PCIDevice to cache the result when initialize
PCI device.
Peter Maydell [Tue, 14 Jun 2016 15:04:25 +0000 (16:04 +0100)]
Merge remote-tracking branch 'remotes/pmaydell/tags/pull-target-arm-20160614-2' into staging
target-arm queue:
* add PMU support for virt machine under KVM
* fix reset and migration of TTBCR(S)
* add virt-2.7 machine type
* QOMify various ARM devices
* implement xilinx DisplayPort device
* don't permit ARMv8-only Neon insns to work on ARMv7
Peter Maydell [Tue, 14 Jun 2016 14:59:15 +0000 (15:59 +0100)]
target-arm: Don't permit ARMv8-only Neon insns on ARMv7
The Neon instructions VCVTA, VCVTM, VCVTN, VCVTP, VRINTA, VRINTM,
VRINTN, VRINTP, VRINTX, and VRINTZ were only introduced with ARMv8,
so they need a guard to make them UNDEF if the CPU only supports ARMv7.
(We got this right for all the other new-in-v8 insns, but forgot
it for these Neon 2-reg-misc ops.)
Most of the control flow logic between send and recv (error checking
etc) is the same. Factor this out into a common send_recv() API.
This is then usable by clients, where the control logic for send
and receive differs only by a boolean. E.g.
Create two variants of DEFINE_VIRT_MACHINE. One, just called
DEFINE_VIRT_MACHINE, that does not set properties that only
the latest machine type should have, and another that does.
This will hopefully reduce potential for errors when adding
new versions.
Andrew Jones [Tue, 14 Jun 2016 14:59:12 +0000 (15:59 +0100)]
hw/arm/virt: separate versioned type-init code
Rename machvirt_info (which is specifically for 2.6 TypeInfo)
to machvirt_2_6_info, and separate the type registration of the
abstract machine type from the versioned type.
Peter Maydell [Tue, 14 Jun 2016 14:59:12 +0000 (15:59 +0100)]
target-arm: Fix reset and migration of TTBCR(S)
Commit 6459b94c26dd666badb3 broke reset and migration of the AArch32
TTBCR(S) register if the guest used non-LPAE page tables. This is
because the AArch32 TTBCR register definition is marked as ARM_CP_ALIAS,
meaning that the AArch64 variant has to handle migration and reset.
Although AArch64 TCR_EL3 doesn't need to care about the mask and
base_mask fields, AArch32 may do so, and so we must use the special
TTBCR reset and raw write functions to ensure they are set correctly.
This doesn't affect TCR_EL2, because the AArch32 equivalent of that
is HTCR, which never uses the non-LPAE page table variant.
Peter Maydell [Tue, 10 May 2016 10:30:42 +0000 (11:30 +0100)]
qdev_try_create(): Assert that devices we put onto the system bus are SysBusDevices
If qdev_try_create() is passed NULL for the bus, it will automatically
put the newly created device onto the default system bus. However
if the device is not actually a SysBusDevice then this will result
in later crashes (for instance when running the monitor "info qtree"
command) because code reasonably assumes that all devices on the system
bus are system bus devices.
Generally the mistake is that the calling code should create the
object with object_new(TYPE_FOO) rather than qdev_create(NULL, TYPE_FOO);
see commit 6749695eaaf346c1 for an example of fixing this bug.
Assert in qdev_try_create() if the device isn't suitable to put on
the system bus, so that this mistake results in failure earlier
and more reliably.
s390x/kvm: Fixup interrupt type for non-adapter I/O interrupts
The current algorithm for I/O interrupts would result in a wrong
interrupt type for subchannel numbers fffe and ffff. In addition
a non adapter interrupt might look like an adapter interrupt for
any subchannel number that has the 0x0400 bit set.
No kernel has ever used the type outside logging - and the logging
was wrong all the time. For everything else the kernel used the
interrupt parameters.
Let's use the KVM_S390_INT_IO macro as for adapter interrupts.
The sclp scp read info call fills in a buffer with information about the
system. With more than 248 CPUs we overflow the 4k buffer of the SCCB,
leading to random data corruption. Basically ALL guest operating systems
call scp read info, so let's limit the machines to 248 CPUs to make it
obvious that >=249 does not work.
As KVM also limits itself to 248 and TCG on s390 does not support
SMP, this should cause no regression for any user as no VMs with more
than 248 VCPUs were ever possible.
Let's introduce a CssDevId to handle device ids of the xx.x.xxxx
type used for channel devices. This has some benefits:
- We can use them in virtio-ccw and split the validity checks for
a channel device id in general from the constraint checking
within the virtio-ccw scope.
- We can reuse the device id type for future non-virtio channel
devices.
While we're at it, improve the validity checks and disallow e.g.
trailing characters.
Halil Pasic [Wed, 27 Jan 2016 12:24:17 +0000 (13:24 +0100)]
s390x/css: clear IO irqs when generating IPI CRW
According to the Principles of Operation (more precisely the subsection
'Channel-Report Word'), a subchannel put into the installed parameters
initialized state is in the same state as after an I/O system reset (just
parameters possibly changed). This implies that any I/O interrupts for that
subchannel are no longer pending (as I/O system resets clear I/O
interrupts). Therefore, we need an interface to clear pending I/O
interrupts. Make css_generate_sch_crws clear the pending IO interrupts for
the subchannel.
Peter Maydell [Tue, 14 Jun 2016 08:30:05 +0000 (09:30 +0100)]
Merge remote-tracking branch 'remotes/dgibson/tags/ppc-for-2.7-20160614' into staging
ppc patch queue for 2016-06-14
Latest patch queue for ppc.
* Allow qemu to support a generic architecture 2.07 (POWER8-era)
compatibility mode. This is useful for guests which are POWER8
aware, but don't know about the specific POWER8 variant that
qemu (and/or KVM) is emulating. (Thomas Huth)
* Fix a bug where macio wasn't removing DMA mappings (Mark Cave-Ayland)
* Add a workaround for Linux guest's miscalculation of maximum
memory address (including hotplugged memory), which could break
when hotplug memory was combined with VFIO. The previous
approach was technically correct by spec, but differed from
PowerVM's behaviour enough to trip a guest kernel bug. This
works around the bug, while remaining correct-to-spec. (Bharata Rao)
# gpg: Signature made Tue 14 Jun 2016 06:53:58 BST
# gpg: using RSA key 0x6C38CACA20D9B392
# gpg: Good signature from "David Gibson <[email protected]>"
# gpg: aka "David Gibson (Red Hat) <[email protected]>"
# gpg: aka "David Gibson (ozlabs.org) <[email protected]>"
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg: It is not certain that the signature belongs to the owner.
# Primary key fingerprint: 75F4 6586 AE61 A66C C44E 87DC 6C38 CACA 20D9 B392
* remotes/dgibson/tags/ppc-for-2.7-20160614:
spapr: Ensure all LMBs are represented in ibm,dynamic-memory
macio: call dma_memory_unmap() at the end of each DMA transfer
Add PowerPC AT_HWCAP2 definitions
ppc: Add PowerISA 2.07 compatibility mode
ppc: Improve PCR bit selection in ppc_set_compat()
ppc: Provide function to get CPU class of the host CPU
ppc: Split pcr_mask settings into supported bits and the register mask
ppc/spapr: Refactor h_client_architecture_support() CPU parsing code
Bharata B Rao [Fri, 10 Jun 2016 05:14:48 +0000 (10:44 +0530)]
spapr: Ensure all LMBs are represented in ibm,dynamic-memory
Memory hotplug can fail for some combinations of RAM and maxmem when
DDW is enabled in the presence of devices like nec-usb-xhci. DDW depends
on maximum addressable memory returned by guest and this value is currently
being calculated wrongly by the guest kernel routine memory_hotplug_max().
While there is an attempt to fix the guest kernel, this patch works
around the problem within QEMU itself.
memory_hotplug_max() routine in the guest kernel arrives at max
addressable memory by multiplying lmb-size with the lmb-count obtained
from ibm,dynamic-memory property. There are two assumptions here:
- All LMBs are part of ibm,dynamic memory: This is not true for PowerKVM
where only hot-pluggable LMBs are present in this property.
- The memory area comprising of RAM and hotplug region is contiguous: This
needn't be true always for PowerKVM as there can be gap between
boot time RAM and hotplug region.
To work around this guest kernel bug, ensure that ibm,dynamic-memory
has information about all the LMBs (RMA, boot-time LMBs, future
hotpluggable LMBs, and dummy LMBs to cover the gap between RAM and
hotpluggable region).
RMA is represented separately by memory@0 node. Hence mark RMA LMBs
and also the LMBs for the gap b/n RAM and hotpluggable region as
reserved and as having no valid DRC so that these LMBs are not considered
by the guest.
Thomas Huth [Tue, 7 Jun 2016 15:39:39 +0000 (17:39 +0200)]
ppc: Improve PCR bit selection in ppc_set_compat()
When using an olderr PowerISA level, all the upper compatibility
bits have to be enabled, too. For example when we want to run
something in PowerISA 2.05 compatibility mode on POWER8, the bit
for 2.06 has to be set beside the bit for 2.05.
Additionally, to make sure that we do not set bits that are not
supported by the host, we apply a mask with the known-to-be-good
bits here, too.
Signed-off-by: Thomas Huth <[email protected]>
[dwg: Added some #ifs to fix compile on 32-bit targets] Signed-off-by: David Gibson <[email protected]>
Thomas Huth [Tue, 7 Jun 2016 15:39:37 +0000 (17:39 +0200)]
ppc: Split pcr_mask settings into supported bits and the register mask
The current pcr_mask values are ambiguous: Should these be the mask
that defines valid bits in the PCR register? Or should these rather
indicate which compatibility levels are possible? Anyway, POWER6 and
POWER7 should certainly not use the same values here. So let's
introduce an additional variable "pcr_supported" here which is
used to indicate the valid compatibility levels, and use pcr_mask
to signal the valid bits in the PCR register.
Thomas Huth [Tue, 7 Jun 2016 15:39:36 +0000 (17:39 +0200)]
ppc/spapr: Refactor h_client_architecture_support() CPU parsing code
The h_client_architecture_support() function has become quite big
and nested already. So factor out the code that takes care of the
sPAPR compatibility PVRs (which will be modified by the following
patches).
* remotes/kraxel/tags/pull-usb-20160613-1:
vl: Eliminate usb_enabled()
pxa2xx: Unconditionally enable USB controller
hw/usb/dev-network.c: Use ldl_le_p() and stl_le_p()
usb-host: add special case for bus+addr
Jan Beulich [Mon, 23 May 2016 06:44:57 +0000 (00:44 -0600)]
xen/blkif: avoid double access to any shared ring request fields
Commit f9e98e5d7a ("xen/blkif: Avoid double access to
src->nr_segments") didn't go far enough: src->operation is also being
used twice. And nothing was done to prevent the compiler from using the
source side of the copy done by blk_get_request() (granted that's very
unlikely).
Move the barrier()s up, and add another one to blk_get_request().
Note that for completing XSA-155, the barrier() getting added to
blk_get_request() would suffice, and hence the changes to xen_blkif.h
are more like just cleanup. And since, as said, the unpatched code
getting compiled to something vulnerable is very unlikely (and not
observed in practice), this isn't being viewed as a new security issue.
Peter Maydell [Mon, 13 Jun 2016 12:05:02 +0000 (13:05 +0100)]
Merge remote-tracking branch 'remotes/berrange/tags/qcrypto-next-2016-06-13-v1' into staging
Merge qcrypto-next 2016/06/13 v1
# gpg: Signature made Mon 13 Jun 2016 12:43:22 BST
# gpg: using RSA key 0xBE86EBB415104FDF
# gpg: Good signature from "Daniel P. Berrange <[email protected]>"
# gpg: aka "Daniel P. Berrange <[email protected]>"
# Primary key fingerprint: DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF
* remotes/berrange/tags/qcrypto-next-2016-06-13-v1:
crypto: aes: always rename internal symbols
crypto: assert that qcrypto_hash_digest_len is in range
crypto: remove temp files on completion of secrets test
TLS: provide slightly more information when TLS certificate loading fails
Mike Frysinger [Mon, 6 Jun 2016 22:05:35 +0000 (18:05 -0400)]
crypto: aes: always rename internal symbols
OpenSSL's libcrypto always defines AES symbols with the same names as
qemu's local aes code. This is problematic when enabling at least curl
as that frequently also uses libcrypto. It might not be noticed when
running, but if you try to statically link, everything falls down.
An example snippet:
LINK qemu-nbd
.../libcrypto.a(aes-x86_64.o): In function 'AES_encrypt':
(.text+0x460): multiple definition of 'AES_encrypt'
crypto/aes.o:aes.c:(.text+0x670): first defined here
.../libcrypto.a(aes-x86_64.o): In function 'AES_decrypt':
(.text+0x9f0): multiple definition of 'AES_decrypt'
crypto/aes.o:aes.c:(.text+0xb30): first defined here
.../libcrypto.a(aes-x86_64.o): In function 'AES_cbc_encrypt':
(.text+0xf90): multiple definition of 'AES_cbc_encrypt'
crypto/aes.o:aes.c:(.text+0xff0): first defined here
collect2: error: ld returned 1 exit status
.../qemu-2.6.0/rules.mak:105: recipe for target 'qemu-nbd' failed
make: *** [qemu-nbd] Error 1
The aes.h header has redefines already for FreeBSD, but go ahead and
enable that for everyone since there's no real good reason to not use
a namespace all the time.
Paolo Bonzini [Fri, 20 May 2016 09:09:54 +0000 (11:09 +0200)]
crypto: assert that qcrypto_hash_digest_len is in range
Otherwise unintended results could happen. For example,
Coverity reports a division by zero in qcrypto_afsplit_hash.
While this cannot really happen, it shows that the contract
of qcrypto_hash_digest_len can be improved.
crypto: remove temp files on completion of secrets test
The secret object tests left some temporary files on disk
when completing. Ensure they are unlink, and rename them
to make it more obvious where they come from.
Alex Bligh [Tue, 5 Apr 2016 19:33:48 +0000 (20:33 +0100)]
TLS: provide slightly more information when TLS certificate loading fails
Give slightly more information when certification loading fails.
Rather than have no information, you now get gnutls's only slightly
less unhelpful error messages.
Eduardo Habkost [Wed, 8 Jun 2016 20:50:24 +0000 (17:50 -0300)]
pxa2xx: Unconditionally enable USB controller
Simplify initialization logic by removing the usb_enabled()
check. The USB controller is part of the SoC, so it doesn't make
sense to create a system where it is not present.
Peter Maydell [Fri, 10 Jun 2016 15:37:57 +0000 (16:37 +0100)]
hw/usb/dev-network.c: Use ldl_le_p() and stl_le_p()
Use stl_le_p() and ldl_le_p() to read and write data from
buffers, rather than using pointer casts and cpu_to_le32()
for writes and le32_to_cpup() for reads. This:
* avoids lots of casts
* works even if the buffer isn't as aligned as the host would like
* avoids using the *_to_cpup() functions which we want to get rid of
Note that there may still be some places where a pointer from the
guest is cast to a pointer to a host structure; these would also
have to be changed for the device to work on a host CPU which
enforces alignment restrictions.
Gerd Hoffmann [Fri, 3 Jun 2016 09:12:55 +0000 (11:12 +0200)]
usb-host: add special case for bus+addr
This patch changes usb-host behavior in case we hostbus= and hostaddr=
properties are used to identify the usb device in question. Instead of
adding the device to the hotplug watchlist we try to open directly using
the given bus number and device address.
Putting a device specified by hostaddr to the hotplug watchlist isn't
a great idea as the address isn't a fixed property. It changes every
time the device is plugged in. So considering this case as "use the
device at bus:addr _now_" is more sane. Also usb-host will throw errors
in case it can't initialize the host device.
Note: For devices on the hotplug watchlist (hostport or vendorid or
productid specified) qemu continues to ignore errors and keeps
monitoring the usb bus to see if the device eventually shows up.
Anthony PERARD [Thu, 9 Jun 2016 15:56:17 +0000 (16:56 +0100)]
exec: Fix qemu_ram_block_from_host for Xen
Since f615f39 (exec: remove ram_addr argument from
qemu_ram_block_from_host), migration under Xen is likely to fail, with a
SEGV of QEMU. But the commit only reveal a bug with the calculation of
the offset value in qemu_ram_block_from_host().
This patch calculates the offset from the ram_addr as
qemu_ram_addr_from_host() will later calculate the ram_addr from the
offset.
Peter Maydell [Mon, 13 Jun 2016 09:12:44 +0000 (10:12 +0100)]
Merge remote-tracking branch 'remotes/rth/tags/pull-tcg-20160611' into staging
TB hashing improvements
# gpg: Signature made Sun 12 Jun 2016 01:12:50 BST
# gpg: using RSA key 0xAD1270CC4DD0279B
# gpg: Good signature from "Richard Henderson <[email protected]>"
# gpg: aka "Richard Henderson <[email protected]>"
# gpg: aka "Richard Henderson <[email protected]>"
# Primary key fingerprint: 9CB1 8DDA F8E8 49AD 2AFC 16A4 AD12 70CC 4DD0 279B
* remotes/rth/tags/pull-tcg-20160611:
translate-all: add tb hash bucket info to 'info jit' dump
tb hash: track translated blocks with qht
qht: add test-qht-par to invoke qht-bench from 'check' target
qht: add qht-bench, a performance benchmark
qht: add test program
qht: QEMU's fast, resizable and scalable Hash Table
qdist: add test program
qdist: add module to represent frequency distributions of data
tb hash: hash phys_pc, pc, and flags with xxhash
exec: add tb_hash_func5, derived from xxhash
qemu-thread: add simple test-and-set spinlock
include/processor.h: define cpu_relax()
seqlock: rename write_lock/unlock to write_begin/end
seqlock: remove optional mutex
compiler.h: add QEMU_ALIGNED() to enforce struct alignment
Emilio G. Cota [Wed, 8 Jun 2016 18:55:32 +0000 (14:55 -0400)]
tb hash: track translated blocks with qht
Having a fixed-size hash table for keeping track of all translation blocks
is suboptimal: some workloads are just too big or too small to get maximum
performance from the hash table. The MRU promotion policy helps improve
performance when the hash table is a little undersized, but it cannot
make up for severely undersized hash tables.
Furthermore, frequent MRU promotions result in writes that are a scalability
bottleneck. For scalability, lookups should only perform reads, not writes.
This is not a big deal for now, but it will become one once MTTCG matures.
The appended fixes these issues by using qht as the implementation of
the TB hash table. This solution is superior to other alternatives considered,
namely:
- master: implementation in QEMU before this patchset
- xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
- xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
MRU is implemented here by adding an intermediate struct
that contains the u32 hash and a pointer to the TB; this
allows us, on an MRU promotion, to copy said struct (that is not
at the head), and put this new copy at the head. After a grace
period, the original non-head struct can be eliminated, and
after another grace period, freed.
- qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
no MRU for lookups; MRU for inserts.
The appended solution is the following:
- qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
no MRU for lookups; MRU for inserts.
The plots below compare the considered solutions. The Y axis shows the
boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
sweeps the number of buckets (or initial number of buckets for qht-autoresize).
The plots in PNG format (and with errorbars) can be seen here:
http://imgur.com/a/Awgnq
Each test runs 5 times, and the entire QEMU process is pinned to a
single core for repeatability of results.
Note that the original point before this patch series is X=15 for "master";
the little sensitivity to the increased number of buckets is due to the
poor hashing function in master.
xxhash-rcu has significant overhead due to the constant churn of allocating
and deallocating intermediate structs for implementing MRU. An alternative
would be do consider failed lookups as "maybe not there", and then
acquire the external lock (tb_lock in this case) to really confirm that
there was indeed a failed lookup. This, however, would not be enough
to implement dynamic resizing--this is more complex: see
"Resizable, Scalable, Concurrent Hash Tables via Relativistic
Programming" by Triplett, McKenney and Walpole. This solution was
discarded due to the very coarse RCU read critical sections that we have
in MTTCG; resizing requires waiting for readers after every pointer update,
and resizes require many pointer updates, so this would quickly become
prohibitive.
qht-fixed-nomru shows that MRU promotion is advisable for undersized
hash tables.
However, qht-dyn-mru shows that MRU promotion is not important if the
hash table is properly sized: there is virtually no difference in
performance between qht-dyn-nomru and qht-dyn-mru.
Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
can achieve with optimum sizing of the hash table, while keeping the hash
table scalable for readers.
The improvement we get before and after this patch for booting debian jessie
with arm-softmmu is:
- Intel Xeon E5-2690: 10.5% less time
- Intel i7-4790K: 5.2% less time
We could get this same improvement _for this particular workload_ by
statically increasing the size of the hash table. But this would hurt
workloads that do not need a large hash table. The dynamic (upward)
resizing allows us to start small and enlarge the hash table as needed.
A quick note on downsizing: the table is resized back to 2**15 buckets
on every tb_flush; this makes sense because it is not guaranteed that the
table will reach the same number of TBs later on (e.g. most bootup code is
thrown away after boot); it makes sense to grow the hash table as
more code blocks are translated. This also avoids the complication of
having to build downsizing hysteresis logic into qht.
Emilio G. Cota [Wed, 8 Jun 2016 18:55:30 +0000 (14:55 -0400)]
qht: add qht-bench, a performance benchmark
This serves as a performance benchmark as well as a stress test
for QHT. We can tweak quite a number of things, including the
number of resize threads and how frequently resizes are triggered.
A performance comparison of QHT vs CLHT[1] and ck_hs[2] using
this same benchmark program can be found here:
http://imgur.com/a/0Bms4
The tests are run on a 64-core AMD Opteron 6376, pinning threads
to cores favoring same-socket cores. For each run, qht-bench is
invoked with:
$ tests/qht-bench -d $duration -n $n -u $u -g $range
, where $duration is in seconds, $n is the number of threads,
$u is the update rate (0.0 to 100.0), and $range is the number
of keys.
Note that ck_hs's performance drops significantly as writes go
up, since it requires an external lock (I used a ck_spinlock)
around every write.
Also, note that CLHT instead of using a seqlock, relies on an
allocator that does not ever return the same address during the
same read-critical section. This gives it a slight performance
advantage over QHT on read-heavy workloads, since the seqlock
writes aren't there.
Emilio G. Cota [Wed, 8 Jun 2016 18:55:28 +0000 (14:55 -0400)]
qht: QEMU's fast, resizable and scalable Hash Table
This is a fast, scalable chained hash table with optional auto-resizing, allowing
reads that are concurrent with reads, and reads/writes that are concurrent
with writes to separate buckets.
A hash table with these features will be necessary for the scalability
of the ongoing MTTCG work; before those changes arrive we can already
benefit from the single-threaded speedup that qht also provides.
Emilio G. Cota [Wed, 8 Jun 2016 18:55:26 +0000 (14:55 -0400)]
qdist: add module to represent frequency distributions of data
Sometimes it is useful to have a quick histogram to represent a certain
distribution -- for example, when investigating a performance regression
in a hash table due to inadequate hashing.
The appended allows us to easily represent a distribution using Unicode
characters. Further, the data structure keeping track of the distribution
is so simple that obtaining its values for off-line processing is trivial.
Emilio G. Cota [Wed, 8 Jun 2016 18:55:25 +0000 (14:55 -0400)]
tb hash: hash phys_pc, pc, and flags with xxhash
For some workloads such as arm bootup, tb_phys_hash is performance-critical.
The is due to the high frequency of accesses to the hash table, originated
by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
More info:
https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html
To dig further into this I modified an arm image booting debian jessie to
immediately shut down after boot. Analysis revealed that quite a bit of time
is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
results in very uneven loading of chains in the hash table's buckets;
the longest observed chain had ~550 elements.
The appended addresses this with two changes:
1) Use xxhash as the hash table's hash function. xxhash is a fast,
high-quality hashing function.
2) Feed the hashing function with not just tb_phys, but also pc and flags.
This improves performance over using just tb_phys for hashing, since that
resulted in some hash buckets having many TB's, while others getting very few;
with these changes, the longest observed chain on a single hash bucket is
brought down from ~550 to ~40.
Tests show that the other element checked for in tb_find_physical,
cs_base, is always a match when tb_phys+pc+flags are a match,
so hashing cs_base is wasteful. It could be that this is an ARM-only
thing, though. UPDATE:
On Tue, Apr 05, 2016 at 08:41:43 -0700, Richard Henderson wrote:
> The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB
> consisting of only a delay slot).
> It may well still turn out to be reasonable to ignore cs_base for hashing.
BTW, after this change the hash table should not be called "tb_hash_phys"
anymore; this is addressed later in this series.
This change gives consistent bootup time improvements. I tested two
host machines:
- Intel Xeon E5-2690: 11.6% less time
- Intel i7-4790K: 19.2% less time
Increasing the number of hash buckets yields further improvements. However,
using a larger, fixed number of buckets can degrade performance for other
workloads that do not translate as many blocks (600K+ for debian-jessie arm
bootup). This is dealt with later in this series.
Reviewed-by: Sergey Fedorov <[email protected]> Signed-off-by: Guillaume Delbergue <[email protected]>
[Rewritten. - Paolo] Signed-off-by: Paolo Bonzini <[email protected]>
[Emilio's additions: use TAS instead of atomic_xchg; emit acquire/release
barriers; return bool from trylock; call cpu_relax() while spinning;
optimize for uncontended locks by acquiring the lock with TAS instead
of TATAS; add qemu_spin_locked().] Signed-off-by: Emilio G. Cota <[email protected]>
Message-Id: <1465412133[email protected]> Signed-off-by: Richard Henderson <[email protected]>
* remotes/kraxel/tags/pull-ui-20160610-1:
console: ignore ui_info updates which don't actually update something
ui/console-gl: Add support for big endian display surfaces
gtk: fix vte version check
ui: fix regression in printing VNC host/port on startup
vnc: drop unused depth arg for set_pixel_format
Peter Maydell [Tue, 17 May 2016 14:18:07 +0000 (15:18 +0100)]
target-i386: Move user-mode exception actions out of user-exec.c
The exception_action() function in user-exec.c is just a call to
cpu_loop_exit() for every target CPU except i386. Since this
function is only called if the target's handle_mmu_fault() hook has
indicated an MMU fault, and that hook is only called from the
handle_cpu_signal() code path, we can simply move the x86-specific
setup into that hook, which allows us to remove the TARGET_I386
ifdef from user-exec.c.
Of the actions that were done by the call to raise_interrupt_err():
* cpu_svm_check_intercept_param() is a no-op in user mode
* check_exception() is a no-op since double faults are impossible
for user-mode
* assignments to cs->exception_index and env->error_code are no-ops
* assigning to env->exception_next_eip is unnecessary because it
is not used unless env->exception_is_int is true
* cpu_loop_exit_restore() is equivalent to cpu_loop_exit() since
pc is 0
which leaves just setting env_>exception_is_int as the action that
needs to be added to x86_cpu_handle_mmu_fault().
Peter Maydell [Tue, 17 May 2016 14:18:06 +0000 (15:18 +0100)]
target-i386: Add comment about do_interrupt_user() next_eip argument
Add a comment to do_interrupt_user() along the same lines as the
existing one for do_interrupt_all() noting that the next_eip
argument is not used unless is_int is true or intno is EXCP_SYSCALL.