# gpg: Signature made Fri 19 Oct 2018 14:42:35 BST
# gpg: using RSA key BE86EBB415104FDF
# gpg: Good signature from "Daniel P. Berrange <[email protected]>"
# gpg: aka "Daniel P. Berrange <[email protected]>"
# Primary key fingerprint: DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF
* remotes/berrange/tags/qcrypto-next-pull-request:
crypto: require nettle >= 2.7.1 for building QEMU
crypto: require libgcrypt >= 1.5.0 for building QEMU
crypto: require gnutls >= 3.1.18 for building QEMU
In several places we use assert(FEATURE), and assume that if FEATURE
is disabled, all following code is removed as unreachable. Which allows
us to compile-out functions that are only present with FEATURE, and
have a link-time failure if the functions remain used.
MinGW does not mark its internal function _assert() as noreturn, so the
compiler cannot see when code is unreachable, which leads to link errors
for this host that are not present elsewhere.
The current build-time failure concerns 62823083b8a2, but I remember
having seen this same error before. Fix it once and for all for MinGW.
* remotes/vivier2/tags/linux-user-for-3.1-pull-request:
linux-user: Implement special usbfs ioctls.
linux-user: Define ordinary usbfs ioctls.
linux-user: Check for Linux USBFS in configure
* remotes/bonzini/tags/for-upstream: (47 commits)
replay: pass raw icount value to replay_save_clock
target/i386: kvm: just return after migrate_add_blocker failed
hyperv_testdev: add SynIC message and event testmodes
hyperv: process POST_MESSAGE hypercall
hyperv: add support for KVM_HYPERV_EVENTFD
hyperv: process SIGNAL_EVENT hypercall
hyperv: add synic event flag signaling
hyperv: add synic message delivery
hyperv: make overlay pages for SynIC
hyperv: only add SynIC in compatible configurations
hyperv: qom-ify SynIC
hyperv:synic: split capability testing and setting
i386: add hyperv-stub for CONFIG_HYPERV=n
default-configs: collect CONFIG_HYPERV* in hyperv.mak
hyperv: factor out arch-independent API into hw/hyperv
hyperv: make hyperv_vp_index inline
hyperv: split hyperv-proto.h into x86 and arch-independent parts
hyperv: rename kvm_hv_sint_route_set_sint
hyperv: make HvSintRoute reference-counted
hyperv: address HvSintRoute by X86CPU pointer
...
Peter Maydell [Fri, 19 Oct 2018 15:17:32 +0000 (16:17 +0100)]
Merge remote-tracking branch 'remotes/rth/tags/pull-tcg-20181018' into staging
Queued tcg patches.
# gpg: Signature made Fri 19 Oct 2018 07:03:20 BST
# gpg: using RSA key 64DF38E8AF7E215F
# gpg: Good signature from "Richard Henderson <[email protected]>"
# Primary key fingerprint: 7A48 1E78 868B 4DB6 A85A 05C0 64DF 38E8 AF7E 215F
* remotes/rth/tags/pull-tcg-20181018: (21 commits)
cputlb: read CPUTLBEntry.addr_write atomically
target/s390x: Check HAVE_ATOMIC128 and HAVE_CMPXCHG128 at translate
target/s390x: Skip wout, cout helpers if op helper does not return
target/s390x: Split do_cdsg, do_lpq, do_stpq
target/s390x: Convert to HAVE_CMPXCHG128 and HAVE_ATOMIC128
target/ppc: Convert to HAVE_CMPXCHG128 and HAVE_ATOMIC128
target/arm: Check HAVE_CMPXCHG128 at translate time
target/arm: Convert to HAVE_CMPXCHG128
target/i386: Convert to HAVE_CMPXCHG128
tcg: Split CONFIG_ATOMIC128
tcg: Add tlb_index and tlb_entry helpers
cputlb: serialize tlb updates with env->tlb_lock
cputlb: fix assert_cpu_is_self macro
exec: introduce tlb_init
target/unicore32: remove tlb_flush from uc32_init_fn
target/alpha: remove tlb_flush from alpha_cpu_initfn
tcg: distribute tcg_time into TCG contexts
tcg: plug holes in struct TCGProfile
tcg: fix use of uninitialized variable under CONFIG_PROFILER
tcg: access cpu->icount_decr.u16.high with atomics
...
Peter Maydell [Fri, 19 Oct 2018 14:30:40 +0000 (15:30 +0100)]
Merge remote-tracking branch 'remotes/jasowang/tags/net-pull-request' into staging
# gpg: Signature made Fri 19 Oct 2018 04:16:03 BST
# gpg: using RSA key EF04965B398D6211
# gpg: Good signature from "Jason Wang (Jason Wang on RedHat) <[email protected]>"
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg: It is not certain that the signature belongs to the owner.
# Primary key fingerprint: 215D 46F4 8246 689E C77F 3562 EF04 965B 398D 6211
* remotes/jasowang/tags/net-pull-request: (26 commits)
qemu-options: Fix bad "macaddr" property in the documentation
e1000: indicate dropped packets in HW counters
net: ignore packet size greater than INT_MAX
pcnet: fix possible buffer overflow
rtl8139: fix possible out of bound access
ne2000: fix possible out of bound access in ne2000_receive
clean up callback when del virtqueue
docs: Add COLO status diagram to COLO-FT.txt
COLO: quick failover process by kick COLO thread
COLO: notify net filters about checkpoint/failover event
filter-rewriter: handle checkpoint and failover event
filter: Add handle_event method for NetFilterClass
COLO: flush host dirty ram from cache
savevm: split the process of different stages for loadvm/savevm
qapi: Add new command to query colo status
qapi/migration.json: Rename COLO unknown mode to none mode.
qmp event: Add COLO_EXIT event to notify users while exited COLO
COLO: Flush memory data from ram cache
ram/COLO: Record the dirty pages that SVM received
COLO: Load dirty pages into SVM's RAM cache firstly
...
Cortland Tölva [Mon, 8 Oct 2018 16:35:21 +0000 (09:35 -0700)]
linux-user: Implement special usbfs ioctls.
Userspace submits a USB Request Buffer to the kernel, optionally
discards it, and finally reaps the URB. Thunk buffers from target
to host and back.
Tested by running an i386 scanner driver on ARMv7 and by running
the PowerPC lsusb utility on x86_64. The discardurb ioctl is
not exercised in these tests.
Roman Kagan [Fri, 21 Sep 2018 08:22:17 +0000 (11:22 +0300)]
hyperv_testdev: add SynIC message and event testmodes
Add testmodes for SynIC messages and events. The message or event
connection setup / teardown is initiated by the guest via new control
codes written to the test device port. Then the test connections bounce
the respective operations back to the guest, i.e. the incoming messages
are posted or the incoming events are signaled on the configured vCPUs.
Roman Kagan [Fri, 21 Sep 2018 08:22:16 +0000 (11:22 +0300)]
hyperv: process POST_MESSAGE hypercall
Add handling of POST_MESSAGE hypercall. For that, add an interface to
regsiter a handler for the messages arrived from the guest on a
particular connection id (IOW set up a message connection in Hyper-V
speak).
Roman Kagan [Fri, 21 Sep 2018 08:22:15 +0000 (11:22 +0300)]
hyperv: add support for KVM_HYPERV_EVENTFD
When setting up a notifier for Hyper-V event connection, try to use the
KVM-assisted one first, and fall back to userspace handling of the
hypercall if the kernel doesn't provide the requested feature.
Roman Kagan [Fri, 21 Sep 2018 08:22:14 +0000 (11:22 +0300)]
hyperv: process SIGNAL_EVENT hypercall
Add handling of SIGNAL_EVENT hypercall. For that, provide an interface
to associate an EventNotifier with an event connection number, so that
it's signaled when the SIGNAL_EVENT hypercall with the matching
connection ID is called by the guest.
Support for using KVM functionality for this will be added in a followup
patch.
Roman Kagan [Fri, 21 Sep 2018 08:22:12 +0000 (11:22 +0300)]
hyperv: add synic message delivery
Add infrastructure to deliver SynIC messages to the SynIC message page.
Note that KVM may also want to deliver (SynIC timer) messages to the
same message slot.
The problem is that the access to a SynIC message slot is controlled by
the value of its .msg_type field which indicates if the slot is being
owned by the hypervisor (zero) or by the guest (non-zero).
This leaves no room for synchronizing multiple concurrent producers.
The simplest way to deal with this for both KVM and QEMU is to only
deliver messages in the vcpu thread. KVM already does this; this patch
makes it for QEMU, too.
Specifically,
- add a function for posting messages, which only copies the message
into the staging buffer if its free, and schedules a work on the
corresponding vcpu to actually deliver it to the guest slot;
- instead of a sint ack callback, set up the sint route with a message
status callback. This function is called in a bh whenever there are
updates to the message slot status: either the vcpu made definitive
progress delivering the message from the staging buffer (succeeded or
failed) or the guest issued EOM; the status is passed as an argument
to the callback.
Roman Kagan [Fri, 21 Sep 2018 08:22:11 +0000 (11:22 +0300)]
hyperv: make overlay pages for SynIC
Per Hyper-V spec, SynIC message and event flag pages are to be
implemented as so called overlay pages. That is, they are owned by the
hypervisor and, when mapped into the guest physical address space,
overlay the guest physical pages such that
1) the overlaid guest page becomes invisible to the guest CPUs until the
overlay page is turned off
2) the contents of the overlay page is preserved when it's turned off
and back on, even at a different address; it's only zeroed at vcpu
reset
This particular nature of SynIC message and event flag pages is ignored
in the current code, and guest physical pages are used directly instead.
This happens to (mostly) work because the actual guests seem not to
depend on the features listed above.
This patch implements those pages as the spec mandates.
Since the extra RAM regions, which introduce migration incompatibility,
are only added at SynIC object creation which only happens when
hyperv_synic_kvm_only == false, no extra compat logic is necessary.
Roman Kagan [Fri, 21 Sep 2018 08:22:10 +0000 (11:22 +0300)]
hyperv: only add SynIC in compatible configurations
Certain configurations do not allow SynIC to be used in QEMU. In
particular,
- when hyperv_vpindex is off, SINT routes can't be used as they refer to
the destination vCPU by vp_index
- older KVM (which doesn't expose KVM_CAP_HYPERV_SYNIC2) zeroes out
SynIC message and event pages on every msr load, breaking migration
OTOH in-KVM users of SynIC -- SynIC timers -- do work in those
configurations, and we shouldn't stop the guest from using them.
To cover both scenarios, introduce an X86CPU property that makes CPU
init code to skip creation of the SynIC object (and thus disables any
SynIC use in QEMU) but keeps the KVM part of the SynIC working.
The property is clear by default but is set via compat logic for older
machine types.
As a result, when hv_synic and a modern machine type are specified, QEMU
will refuse to run unless vp_index is on and the kernel is recent
enough. OTOH with an older machine type QEMU will run fine with
hv_synic=on against an older kernel and/or without vp_index enabled but
will disallow the in-QEMU uses of SynIC (in e.g. VMBus).
Roman Kagan [Fri, 21 Sep 2018 08:22:09 +0000 (11:22 +0300)]
hyperv: qom-ify SynIC
Make Hyper-V SynIC a device which is attached as a child to a CPU. For
now it only makes SynIC visibile in the qom hierarchy, and maintains its
internal fields in sync with the respecitve msrs of the parent cpu (the
fields will be used in followup patches).
Roman Kagan [Fri, 21 Sep 2018 08:22:08 +0000 (11:22 +0300)]
hyperv:synic: split capability testing and setting
Put a bit more consistency into handling KVM_CAP_HYPERV_SYNIC capability,
by checking its availability and determining the feasibility of hv-synic
property first, and enabling it later.
Roman Kagan [Fri, 21 Sep 2018 08:20:41 +0000 (11:20 +0300)]
i386: add hyperv-stub for CONFIG_HYPERV=n
This will allow to build slightly leaner QEMU that supports some HyperV
features of KVM (e.g. SynIC timers, PV spinlocks, APIC assists, etc.)
but nothing else on the QEMU side.
Roman Kagan [Fri, 21 Sep 2018 08:20:39 +0000 (11:20 +0300)]
hyperv: factor out arch-independent API into hw/hyperv
A significant part of hyperv.c is not actually tied to x86, and can
be moved to hw/.
This will allow to maintain most of Hyper-V and VMBus
target-independent, and to avoid conflicts with inclusion of
arch-specific headers down the road in VMBus implementation.
Also this stuff can now be opt-out with CONFIG_HYPERV.
Roman Kagan [Fri, 21 Sep 2018 08:20:37 +0000 (11:20 +0300)]
hyperv: split hyperv-proto.h into x86 and arch-independent parts
Some parts of the Hyper-V hypervisor-guest interface appear to be
target-independent, so move them into a proper header.
Not that Hyper-V ARM64 emulation is around the corner but it seems more
conveninent to have most of Hyper-V and VMBus target-independent, and
allows to avoid conflicts with inclusion of arch-specific headers down
the road in VMBus implementation.
Roman Kagan [Fri, 21 Sep 2018 08:18:35 +0000 (11:18 +0300)]
hyperv: make HvSintRoute reference-counted
Multiple entities (e.g. VMBus devices) can use the same SINT route. To
make their lives easier in maintaining SINT route ownership, make it
reference-counted. Adjust the respective API names accordingly.
Roman Kagan [Fri, 21 Sep 2018 08:18:29 +0000 (11:18 +0300)]
hyperv_testdev: refactor for better maintainability
Make hyperv_testdev slightly easier to follow and enhance in future.
For that, put the hyperv sint routes (wrapped in a helper structure) on
a linked list rather than a fixed-size array. Besides, this way
HvSintRoute can be treated as an opaque structure, allowing for easier
refactoring of the core Hyper-V SynIC code in followup pathches.
Paolo Bonzini [Sat, 13 Oct 2018 09:52:34 +0000 (11:52 +0200)]
scsi-disk: fix rerror/werror=ignore
rerror=ignore was returning true from scsi_handle_rw_error but the callers were not
calling scsi_req_complete when rerror=ignore returns true (this is the correct thing
to do when true is returned after executing a passthrough command). Fix this by
calling it in scsi_handle_rw_error.
Paolo Bonzini [Sat, 13 Oct 2018 09:49:16 +0000 (11:49 +0200)]
scsi-disk: fix double completion of failing passthrough requests
If a command fails with a sense that scsi_sense_buf_to_errno converts to
ECANCELED/EAGAIN/ENOTCONN or with a unit attention, scsi_req_complete is
called twice. This caused a crash.
Igor Mammedov [Tue, 16 Oct 2018 13:33:40 +0000 (15:33 +0200)]
call HotplugHandler->plug() as the last step in device realization
When [2] was fixed it was agreed that adding and calling post_plug()
callback after device_reset() was low risk approach to hotfix issue
right before release. So it was merged instead of moving already
existing plug() callback after device_reset() is called which would
be more risky and require all plug() callbacks audit.
Looking at the current plug() callbacks, it doesn't seem that moving
plug() callback after device_reset() is breaking anything, so here
goes agreed upon [3] proper fix which essentially reverts [1][2]
and moves plug() callback after device_reset().
This way devices always comes to plug() stage, after it's been fully
initialized (including being reset), which fixes race condition [2]
without need for an extra post_plug() callback.
Artem Pisarenko [Thu, 18 Oct 2018 07:12:55 +0000 (13:12 +0600)]
vl, qapi: offset calculation in RTC_CHANGE event reverted
Return value of qemu_timedate_diff(), used for calculation offset in
QAPI 'RTC_CHANGE' event, restored to keep compatibility. Since it
wasn't documented that difference is relative to host clock
advancement, this change also adds important note to 'RTC_CHANGE'
event description to highlight established implementation specifics.
Artem Pisarenko [Thu, 18 Oct 2018 07:12:54 +0000 (13:12 +0600)]
Fixes RTC bug with base datetime shifts in clock=vm
This makes all current "-rtc" option parameters combinations produce
fixed/unambiguous RTC timedate reference for hardware emulation
frontends.
It restores determinism of guest execution when used with clock=vm and
specified base <datetime> value.
Roman Bolshakov [Thu, 18 Oct 2018 13:44:01 +0000 (16:44 +0300)]
i386: hvf: Fix register refs if REX is present
According to Intel(R)64 and IA-32 Architectures Software Developer's
Manual, the following one-byte registers should be fetched when REX
prefix is present (sorted by reg encoding index):
AL, CL, DL, BL, SPL, BPL, SIL, DIL, R8L - R15L
The first 8 are fetched if REX.R is zero, the last 8 if non-zero.
The following registers should be fetched for instructions without REX
prefix (also sorted by reg encoding index):
AL, CL, DL, BL, AH, CH, DH, BH
Current emulation code doesn't handle accesses to SPL, BPL, SIL, DIL
when REX is present, thefore an instruction 40883e "mov %dil,(%rsi)" is
decoded as "mov %bh,(%rsi)".
That caused an infinite loop in vp_reset:
https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg03293.html
Hyper-V PV IPI support is merged to KVM, enable the feature in Qemu. When
enabled, this allows Windows guests to send IPIs to other vCPUs with a
single hypercall even when there are >64 vCPUs in the request.
Pavel Dovgalyuk [Thu, 18 Oct 2018 06:33:45 +0000 (09:33 +0300)]
replay: don't process events at virtual clock checkpoint
As QEMU becomes more multi-threaded and non-synchronized, checkpoints
move from thread to thread. And the event queue that processed at checkpoints
should belong to the same thread in both record and replay executions.
This patch disables asynchronous event processing at virtual clock
checkpoint, because it may be invoked in different threads at record and
replay. This patch is temporary fix until the checkpoints are completely
refactored.
Paolo Bonzini [Thu, 18 Oct 2018 12:35:23 +0000 (14:35 +0200)]
target-i386: kvm: do not initialize padding fields
The exception.pad field is going to be renamed to pending in an upcoming
header file update. Remove the unnecessary initialization; it was
introduced to please valgrind (commit 7e680753cfa2) but they were later
rendered unnecessary by commit 076796f8fd27f4d, which added the "= {}"
initializer to the declaration of "events". Therefore the patch does
not change behavior in any way.
Artem Pisarenko [Wed, 17 Oct 2018 08:24:20 +0000 (14:24 +0600)]
qemu-timer: avoid checkpoints for virtual clock timers in external subsystems
Adds EXTERNAL attribute definition to qemu timers subsystem and assigns
it to virtual clock timers, used in slirp (ICMP IPv6) and ui (key queue).
Virtual clock processing in rr mode can use this attribute instead of a
separate clock type.
Artem Pisarenko [Wed, 17 Oct 2018 08:24:19 +0000 (14:24 +0600)]
qemu-timer: introduce timer attributes
Attributes are simple flags, associated with individual timers for their
whole lifetime. They intended to be used to mark individual timers for
special handling when they fire.
New/init functions family in timer interface updated and refactored (new
'attribute' argument added, timer_list replaced with timer_list_group+type
combinations, comments improved to avoid info duplication). Also existing
aio interface extended with attribute-enabled variants of functions,
which create/initialize timers.
Artem Pisarenko [Wed, 17 Oct 2018 08:24:18 +0000 (14:24 +0600)]
Revert some patches from recent [PATCH v6] "Fixing record/replay and adding reverse debugging"
That patch series introduced new virtual clock type for use in external
subsystems. It breaks desired behavior in non-record/replay usage
scenarios due to a small change to existing behavior. Processing of
virtual timers belonging to new clock type is kicked off to the main
loop, which makes these timers asynchronous with vCPU thread and,
in icount mode, with whole guest execution. This breaks expected
determinism in non-record/replay icount mode of emulation where these
"external subsystems" are isolated from the host (i.e. they are
external only to guest core, not to the entire emulation environment).
Example for slirp ("user" backend for network device):
User runs qemu in icount mode with rtc clock=vm without any external
communication interfaces but with "-netdev user,restrict=on". It expects
deterministic execution, because network services are emulated inside
qemu and isolated from host. There are no reasons to get reply from DHCP
server with different delay or something like that.
Paolo Bonzini [Fri, 24 Aug 2018 15:03:41 +0000 (17:03 +0200)]
es1370: more fixes for ADC_FRAMEADR and ADC_FRAMECNT
They are not consecutive with DAC1_FRAME* and DAC2_FRAME*; Coverity
still complains about es1370_read, while es1370_write was fixed in
commit cf9270e5220671f49cc238deaf6136669cc07ae1.
Fixes: 154c1d1f960c5147a3f8ef00907504112f271cd8 Signed-off-by: Paolo Bonzini <[email protected]>
Peter Maydell [Fri, 19 Oct 2018 09:08:31 +0000 (10:08 +0100)]
Merge remote-tracking branch 'remotes/amarkovic/tags/mips-queue-october-2018-part1-v2' into staging
MIPS queue October 2018, part1, v2
# gpg: Signature made Thu 18 Oct 2018 19:39:00 BST
# gpg: using RSA key D4972A8967F75A65
# gpg: Good signature from "Aleksandar Markovic <[email protected]>"
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 8526 FBF1 5DA3 811F 4A01 DD75 D497 2A89 67F7 5A65
* remotes/amarkovic/tags/mips-queue-october-2018-part1-v2: (28 commits)
target/mips: Add opcodes for nanoMIPS EVA instructions
target/mips: Fix misplaced 'break' in handling of NM_SHRA_R_PH
target/mips: Fix emulation of microMIPS R6 <SELEQZ|SELNEZ>.<D|S>
target/mips: Implement hardware page table walker for MIPS32
target/mips: Add reset state for PWSize and PWField registers
target/mips: Add CP0 PWCtl register
target/mips: Add CP0 PWSize register
target/mips: Add CP0 PWField register
target/mips: Add CP0 PWBase register
target/mips: Add CP0 Config2 to DisasContext
target/mips: Improve DSP R2/R3-related naming
target/mips: Add availability control for DSP R3 ASE
target/mips: Add bit definitions for DSP R3 ASE
target/mips: Reorganize bit definitions for insn_flags (ISAs/ASEs flags)
target/mips: Increase 'supported ISAs/ASEs' flag holder size
target/mips: Add opcode values of MXU ASE
target/mips: Add organizational chart of MXU ASE
target/mips: Add assembler mnemonics list for MXU ASE
target/mips: Add basic description of MXU ASE
target/mips: Add a comment before each CP0 register section in cpu.h
...
Jason Wang [Tue, 16 Oct 2018 09:40:45 +0000 (17:40 +0800)]
e1000: indicate dropped packets in HW counters
The e1000 emulation silently discards RX packets if there's
insufficient space in the ring buffer. This leads to errors
on higher-level protocols in the guest, with no indication
about the error cause.
This patch increments the "Missed Packets Count" (MPC) and
"Receive No Buffers Count" (RNBC) HW counters in this case.
As the emulation has no FIFO for buffering packets that can't
immediately be pushed to the guest, these two registers are
practically equivalent (see 10.2.7.4, 10.2.7.33 in
https://www.intel.com/content/www/us/en/embedded/products/networking/82574l-gbe-controller-datasheet.html).
On a Linux guest, the register content will be reflected in
the "rx_missed_errors" and "rx_no_buffer_count" stats from
"ethtool -S", and in the "missed" stat from "ip -s -s link show",
giving at least some hint about the error cause inside the guest.
If the cause is known, problems like this can often be avoided
easily, by increasing the number of RX descriptors in the guest
e1000 driver (e.g under Linux, "e1000.RxDescriptors=1024").
The patch also adds a qemu trace message for this condition.
Jason Wang [Wed, 30 May 2018 05:16:36 +0000 (13:16 +0800)]
net: ignore packet size greater than INT_MAX
There should not be a reason for passing a packet size greater than
INT_MAX. It's usually a hint of bug somewhere, so ignore packet size
greater than INT_MAX in qemu_deliver_packet_iov()
Jason Wang [Wed, 30 May 2018 04:11:30 +0000 (12:11 +0800)]
pcnet: fix possible buffer overflow
In pcnet_receive(), we try to assign size_ to size which converts from
size_t to integer. This will cause troubles when size_ is greater
INT_MAX, this will lead a negative value in size and it can then pass
the check of size < MIN_BUF_SIZE which may lead out of bound access
for both buf and buf1.
Jason Wang [Wed, 30 May 2018 05:07:43 +0000 (13:07 +0800)]
rtl8139: fix possible out of bound access
In rtl8139_do_receive(), we try to assign size_ to size which converts
from size_t to integer. This will cause troubles when size_ is greater
INT_MAX, this will lead a negative value in size and it can then pass
the check of size < MIN_BUF_SIZE which may lead out of bound access of
for both buf and buf1.
Jason Wang [Wed, 30 May 2018 05:08:15 +0000 (13:08 +0800)]
ne2000: fix possible out of bound access in ne2000_receive
In ne2000_receive(), we try to assign size_ to size which converts
from size_t to integer. This will cause troubles when size_ is greater
INT_MAX, this will lead a negative value in size and it can then pass
the check of size < MIN_BUF_SIZE which may lead out of bound access of
for both buf and buf1.
Before, we did not clear callback like handle_output when delete
the virtqueue which may result be segmentfault.
The scene is as follows:
1. Start a vm with multiqueue vhost-net,
2. then we write VIRTIO_PCI_GUEST_FEATURES in PCI configuration to
triger multiqueue disable in this vm which will delete the virtqueue.
In this step, the tx_bh is deleted but the callback virtio_net_handle_tx_bh
still exist.
3. Finally, we write VIRTIO_PCI_QUEUE_NOTIFY in PCI configuration to
notify the deleted virtqueue. In this way, virtio_net_handle_tx_bh
will be called and qemu will be crashed.
Although the way described above is uncommon, we had better reinforce it.
filter-rewriter: handle checkpoint and failover event
After one round of checkpoint, the states between PVM and SVM
become consistent, so it is unnecessary to adjust the sequence
of net packets for old connections, besides, while failover
happens, filter-rewriter will into failover mode that needn't
handle the new TCP connection.
savevm: split the process of different stages for loadvm/savevm
There are several stages during loadvm/savevm process. In different stage,
migration incoming processes different types of sections.
We want to control these stages more accuracy, it will benefit COLO
performance, we don't have to save type of QEMU_VM_SECTION_START
sections everytime while do checkpoint, besides, we want to separate
the process of saving/loading memory and devices state.
So we add three new helper functions: qemu_load_device_state() and
qemu_savevm_live_state() to achieve different process during migration.
Besides, we make qemu_loadvm_state_main() and qemu_save_device_state()
public, and simplify the codes of qemu_save_device_state() by calling the
wrapper qemu_savevm_state_header().
qmp event: Add COLO_EXIT event to notify users while exited COLO
If some errors happen during VM's COLO FT stage, it's important to
notify the users of this event. Together with 'x-colo-lost-heartbeat',
Users can intervene in COLO's failover work immediately.
If users don't want to get involved in COLO's failover verdict,
it is still necessary to notify users that we exited COLO mode.
During the time of VM's running, PVM may dirty some pages, we will transfer
PVM's dirty pages to SVM and store them into SVM's RAM cache at next checkpoint
time. So, the content of SVM's RAM cache will always be same with PVM's memory
after checkpoint.
Instead of flushing all content of PVM's RAM cache into SVM's MEMORY,
we do this in a more efficient way:
Only flush any page that dirtied by PVM since last checkpoint.
In this way, we can ensure SVM's memory same with PVM's.
Besides, we must ensure flush RAM cache before load device state.
ram/COLO: Record the dirty pages that SVM received
We record the address of the dirty pages that received,
it will help flushing pages that cached into SVM.
Here, it is a trick, we record dirty pages by re-using migration
dirty bitmap. In the later patch, we will start the dirty log
for SVM, just like migration, in this way, we can record both
the dirty pages caused by PVM and SVM, we only flush those dirty
pages from RAM cache while do checkpoint.
COLO: Load dirty pages into SVM's RAM cache firstly
We should not load PVM's state directly into SVM, because there maybe some
errors happen when SVM is receving data, which will break SVM.
We need to ensure receving all data before load the state into SVM. We use
an extra memory to cache these data (PVM's ram). The ram cache in secondary side
is initially the same as SVM/PVM's memory. And in the process of checkpoint,
we cache the dirty pages of PVM into this ram cache firstly, so this ram cache
always the same as PVM's memory at every checkpoint, then we flush this cached ram
to SVM after we receive all PVM's state.
We need to know if migration is going into COLO state for
incoming side before start normal migration.
Instead by using the VMStateDescription to send colo_state
from source side to destination side, we use MIG_CMD_ENABLE_COLO
to indicate whether COLO is enabled or not.
While do checkpoint, we need to flush all the unhandled packets,
By using the filter notifier mechanism, we can easily to notify
every compare object to do this process, which runs inside
of compare threads as a coroutine.
filter-rewriter: Add TCP state machine and fix memory leak in connection_track_table
We add almost full TCP state machine in filter-rewriter, except
TCPS_LISTEN and some simplify in VM active close FIN states.
The reason for this simplify job is because guest kernel will track
the TCP status and wait 2MSL time too, if client resend the FIN packet,
guest will resend the last ACK, so we needn't wait 2MSL time in filter-rewriter.
After a net connection is closed, we didn't clear its related resources
in connection_track_table, which will lead to memory leak.
Let's track the state of net connection, if it is closed, its related
resources will be cleared up.
Emilio G. Cota [Tue, 16 Oct 2018 15:38:40 +0000 (11:38 -0400)]
cputlb: read CPUTLBEntry.addr_write atomically
Updates can come from other threads, so readers that do not
take tlb_lock must use atomic_read to avoid undefined
behaviour (UB).
This completes the conversion to tlb_lock. This conversion results
on average in no performance loss, as the following experiments
(run on an Intel i7-6700K CPU @ 4.00GHz) show.
Notes:
- tlb-lock-v2 corresponds to an implementation with a mutex.
- tlb-lock-v3 corresponds to the current implementation, i.e.
a spinlock and a single lock acquisition in tlb_set_page_with_attrs.
GCC7+ will no longer advertise support for 16-byte __atomic operations
if only cmpxchg is supported, as for x86_64. Fortunately, x86_64 still
has support for __sync_compare_and_swap_16 and we can make use of that.
AArch64 does not have, nor ever has had such support, so open-code it.
Emilio G. Cota [Tue, 9 Oct 2018 17:45:56 +0000 (13:45 -0400)]
cputlb: serialize tlb updates with env->tlb_lock
Currently we rely on atomic operations for cross-CPU invalidations.
There are two cases that these atomics miss: cross-CPU invalidations
can race with either (1) vCPU threads flushing their TLB, which
happens via memset, or (2) vCPUs calling tlb_reset_dirty on their TLB,
which updates .addr_write with a regular store. This results in
undefined behaviour, since we're mixing regular and atomic ops
on concurrent accesses.
Fix it by using tlb_lock, a per-vCPU lock. All updaters of tlb_table
and the corresponding victim cache now hold the lock.
The readers that do not hold tlb_lock must use atomic reads when
reading .addr_write, since this field can be updated by other threads;
the conversion to atomic reads is done in the next patch.
Note that an alternative fix would be to expand the use of atomic ops.
However, in the case of TLB flushes this would have a huge performance
impact, since (1) TLB flushes can happen very frequently and (2) we
currently use a full memory barrier to flush each TLB entry, and a TLB
has many entries. Instead, acquiring the lock is barely slower than a
full memory barrier since it is uncontended, and with a single lock
acquisition we can flush the entire TLB.