Peter Maydell [Tue, 25 Nov 2014 18:21:45 +0000 (18:21 +0000)]
qemu-timer: Avoid overflows when converting timeout to struct timespec
In qemu_poll_ns(), when we convert an int64_t nanosecond timeout into
a struct timespec, we may accidentally run into overflow problems if
the timeout is very long. This happens because the tv_sec field is a
time_t, which is signed, so we might end up setting it to a negative
value by mistake. This will result in what was intended to be a
near-infinite timeout turning into an instantaneous timeout, and we'll
busy loop. Cap the maximum timeout at INT32_MAX seconds (about 68 years)
to avoid this problem.
This specifically manifested on ARM hosts as an extreme slowdown on
guest shutdown (when the guest reprogrammed the PL031 RTC to not
generate alarms using a very long timeout) but could happen on other
hosts and guests too.
Peter Maydell [Wed, 26 Nov 2014 12:18:00 +0000 (12:18 +0000)]
Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging
The final 2.2 patches from me.
# gpg: Signature made Wed 26 Nov 2014 11:12:25 GMT using RSA key ID 78C7AE83
# gpg: Good signature from "Paolo Bonzini <[email protected]>"
# gpg: aka "Paolo Bonzini <[email protected]>"
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg: It is not certain that the signature belongs to the owner.
# Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4 E2F7 7E15 100C CD36 69B1
# Subkey fingerprint: F133 3857 4B66 2389 866C 7682 BFFB D25F 78C7 AE83
* remotes/bonzini/tags/for-upstream:
s390x/kvm: Fix compile error
fw_cfg: fix boot order bug when dynamically modified via QOM
-machine vmport=auto: Fix handling of VMWare ioport emulation for xen
Gonglei [Tue, 25 Nov 2014 04:38:19 +0000 (12:38 +0800)]
fw_cfg: fix boot order bug when dynamically modified via QOM
When we dynamically modify boot order, the length of
boot order will be changed, but we don't update
s->files->f[i].size with new length. This casuse
seabios read a wrong vale of qemu cfg file about
bootorder.
Gerd Hoffmann [Tue, 25 Nov 2014 13:54:17 +0000 (14:54 +0100)]
input: move input-send-event into experimental namespace
Ongoing discussions on how we are going to specify the console,
so tag the command as experiental so we can refine things in
the 2.3 development cycle.
Peter Maydell [Mon, 24 Nov 2014 19:31:50 +0000 (19:31 +0000)]
Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging
pc, pci, misc bugfixes
A bunch of bugfixes for 2.2.
Signed-off-by: Michael S. Tsirkin <[email protected]>
# gpg: Signature made Mon 24 Nov 2014 18:59:47 GMT using RSA key ID D28D5469
# gpg: Good signature from "Michael S. Tsirkin <[email protected]>"
# gpg: aka "Michael S. Tsirkin <[email protected]>"
* remotes/mst/tags/for_upstream:
pc: acpi: mark all possible CPUs as enabled in SRAT
pcie: fix improper use of negative value
pcie: fix typo in pcie_cap_deverr_init()
target-i386: move generic memory hotplug methods to DSDTs
acpi-build: mark RAM dirty on table update
hw/pci: fix crash on shpc error flow
pc: count in 1Gb hugepage alignment when sizing hotplug-memory container
pc: explicitly check maxmem limit when adding DIMM
pc: pc-dimm: use backend alignment during address auto allocation
pc: align DIMM's address/size by backend's alignment value
memory: expose alignment used for allocating RAM as MemoryRegion API
pc: limit DIMM address and size to page aligned values
pc: make pc_dimm_plug() more readble
pc: kvm: check if KVM has free memory slots to avoid abort()
qemu-char: fix tcp_get_fds
Igor Mammedov [Mon, 10 Nov 2014 16:20:50 +0000 (16:20 +0000)]
pc: acpi: mark all possible CPUs as enabled in SRAT
If QEMU is started with -numa ... Windows only notices that
CPU has been hot-added but it will not online such CPUs.
It's caused by the fact that possible CPUs are flagged as
not enabled in SRAT and Windows honoring that information
doesn't use corresponding CPU.
ACPI 5.0 Spec regarding to flag says:
"
Table 5-47 Local APIC Flags
...
Enabled: if zero, this processor is unusable, and the operating system
support will not attempt to use it.
"
Fix QEMU to adhere to spec and mark possible CPUs as enabled
in SRAT.
With that Windows onlines hot-added CPUs as expected.
acpi build modifies internal FW CFG RAM on first access
but we forgot to mark it dirty.
If this RAM has been migrated already, it won't be
migrated again, returning corrupted tables to guest.
If the pci bridge enters in error flow as part
of init process it will only delete the shpc mmio
subregion but not remove it from the properties list,
resulting in segmentation fault when the bridge runs
the exit function.
Example: add a pci bridge without specifing the chassis number:
<qemu-bin> ... -device pci-bridge,id=p1
Result:
(qemu) qemu-system-x86_64: -device pci-bridge,id=p1: Bridge chassis not specified. Each bridge is required to be assigned a unique chassis id > 0.
qemu-system-x86_64: -device pci-bridge,id=p1: Device
initialization failed.
Segmentation fault (core dumped)
if (child->class->unparent) {
#0 0x00005555558d629b in object_finalize_child_property (obj=0x555556d2e830, name=0x555556d30630 "shpc-mmio[0]", opaque=0x555556a42fc8) at qom/object.c:1078
#1 0x00005555558d4b1f in object_property_del_all (obj=0x555556d2e830) at qom/object.c:367
#2 0x00005555558d4ca1 in object_finalize (data=0x555556d2e830) at qom/object.c:412
#3 0x00005555558d55a1 in object_unref (obj=0x555556d2e830) at qom/object.c:720
#4 0x000055555572c907 in qdev_device_add (opts=0x5555563544f0) at qdev-monitor.c:566
#5 0x0000555555744f16 in device_init_func (opts=0x5555563544f0, opaque=0x0) at vl.c:2213
#6 0x00005555559cf5f0 in qemu_opts_foreach (list=0x555555e0f8e0 <qemu_device_opts>, func=0x555555744efa <device_init_func>, opaque=0x0, abort_on_failure=1) at util/qemu-option.c:1057
#7 0x000055555574a11b in main (argc=16, argv=0x7fffffffdde8, envp=0x7fffffffde70) at vl.c:423
Unparent the shpc mmio region as part of shpc cleanup.
Igor Mammedov [Fri, 31 Oct 2014 16:38:42 +0000 (16:38 +0000)]
pc: count in 1Gb hugepage alignment when sizing hotplug-memory container
if DIMMs with different size/alignment are interleaved
in creation order, it could lead to hotplug-memory
container fragmentation and following inability to use
all RAM upto maxmem.
For example:
-m 4G,slots=3,maxmem=7G
-object memory-backend-file,id=mem-1,size=256M,mem-path=/pagesize-2MB
-device pc-dimm,id=mem1,memdev=mem-1
-object memory-backend-file,id=mem-2,size=1G,mem-path=/pagesize-1GB
-device pc-dimm,id=mem2,memdev=mem-2
-object memory-backend-file,id=mem-3,size=256M,mem-path=/pagesize-2MB
-device pc-dimm,id=mem3,memdev=mem-3
fragments hotplug-memory container and doesn't allow
to use 1GB hugepage backend to consume remainig 1Gb.
To ease managment factor count in max 1Gb alignment for
each memory slot when sizing hotplug-memory region so
that regadless of fragmentaion it would be possible to
add max aligned DIMM.
Igor Mammedov [Fri, 31 Oct 2014 16:38:41 +0000 (16:38 +0000)]
pc: explicitly check maxmem limit when adding DIMM
Currently maxmem limit is not checked and depends on
hotplug region container not being able to fit more RAM
than maxmem. Do check explicitly so that it would
be possible to change hotplug container size later
to deal with fragmentation.
Peter Maydell [Mon, 24 Nov 2014 13:50:22 +0000 (13:50 +0000)]
Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging
Three patches to fix ExtINT for the QEMU implementation of the local APIC.
# gpg: Signature made Mon 24 Nov 2014 13:38:36 GMT using RSA key ID 78C7AE83
# gpg: Good signature from "Paolo Bonzini <[email protected]>"
# gpg: aka "Paolo Bonzini <[email protected]>"
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg: It is not certain that the signature belongs to the owner.
# Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4 E2F7 7E15 100C CD36 69B1
# Subkey fingerprint: F133 3857 4B66 2389 866C 7682 BFFB D25F 78C7 AE83
* remotes/bonzini/tags/for-upstream:
apic: fix incorrect handling of ExtINT interrupts wrt processor priority
apic: fix loss of IPI due to masked ExtINT
apic: avoid getting out of halted state on masked PIC interrupts
Paolo Bonzini [Tue, 11 Nov 2014 12:14:18 +0000 (13:14 +0100)]
apic: fix incorrect handling of ExtINT interrupts wrt processor priority
This fixes another failure with ExtINT, demonstrated by QNX. The failure
mode is as follows:
- IPI sent to cpu 0 (bit set in APIC irr)
- IPI accepted by cpu 0 (bit cleared in irr, set in isr)
- IPI sent to cpu 0 (bit set in both irr and isr)
- PIC interrupt sent to cpu 0
The PIC interrupt causes CPU_INTERRUPT_HARD to be set, but
apic_irq_pending observes that the highest pending APIC interrupt priority
(the IPI) is the same as the processor priority (since the IPI is still
being handled), so apic_get_interrupt returns a spurious interrupt rather
than the pending PIC interrupt. The result is an endless sequence of
spurious interrupts, since nothing will clear CPU_INTERRUPT_HARD.
Instead, ExtINT interrupts should have ignored the processor priority.
Calling apic_check_pic early in apic_get_interrupt ensures that
apic_deliver_pic_intr is called instead of delivering the spurious
interrupt. apic_deliver_pic_intr then clears CPU_INTERRUPT_HARD if needed.
Paolo Bonzini [Tue, 11 Nov 2014 12:14:14 +0000 (13:14 +0100)]
apic: fix loss of IPI due to masked ExtINT
This patch fixes an obscure failure of the QNX kernel on QEMU x86 SMP.
In QNX, all hardware interrupts come via the PIC, and are delivered by
the cpu 0 LAPIC in ExtINT mode, while IPIs are delivered by the LAPIC
in fixed mode.
This bug happens as follows:
- cpu 0 masks a particular PIC interrupt
- IPI sent to cpu 0 (CPU_INTERRUPT_HARD is set)
- before the IPI is accepted, the masked interrupt line is asserted by the
device
Since the interrupt is masked, apic_deliver_pic_intr will clear
CPU_INTERRUPT_HARD. The IPI will still be set in the APIC irr, but since
CPU_INTERRUPT_HARD is not set the cpu will not notice. Depending on the
scenario this can cause a system hang, i.e. if cpu 0 is expected to unmask
the interrupt.
In order to fix this, do a full check of the APIC before an EXTINT
is acknowledged. This can result in clearing CPU_INTERRUPT_HARD, but
can also result in delivering the lost IPI.
Paolo Bonzini [Tue, 11 Nov 2014 12:14:05 +0000 (13:14 +0100)]
apic: avoid getting out of halted state on masked PIC interrupts
After the next patch, if a masked PIC interrupts causes CPU_INTERRUPT_POLL
to be set, the CPU will spuriously get out of halted state. While this
is technically valid, we should avoid that.
Make CPU_INTERRUPT_POLL run apic_update_irq in the right thread and then
look at CPU_INTERRUPT_HARD. If CPU_INTERRUPT_HARD does not get set,
do not report the CPU as having work.
Also move the handling of software-disabled APIC from apic_update_irq
to apic_irq_pending, and always trigger CPU_INTERRUPT_POLL. This will
be important once we will add a case that resets CPU_INTERRUPT_HARD
from apic_update_irq. We want to run it even if we go through
CPU_INTERRUPT_POLL, and even if the local APIC is software disabled.
The main reason for reverting this commit before the 2.2 release is that
it adds a QAPI interface that we don't want to keep: The 'nocow' flag
doesn't generally make sense for block nodes, but only for the raw-posix
driver. It should therefore be part of ImageInfoSpecific rather than
ImageInfo.
The commit contains more problems, but unlike the API stability issue
they wouldn't justify reverting it.
Igor Mammedov [Fri, 31 Oct 2014 16:38:36 +0000 (16:38 +0000)]
pc: limit DIMM address and size to page aligned values
When running in KVM mode, kvm_set_phys_mem() will silently
fail if registered MemoryRegion address/size is not page
aligned. Causing memory hotplug failure in guest.
Mapping non aligned MemoryRegion in TCG mode 'works', but
sane guest OS still expects page aligned memory module
and fails to initialize it if it's not aligned.
So do not allow non aligned (i.e. valid) address/size
values for DIMM to avoid either KVM failure or guest
issues caused by it.
Gonglei [Thu, 20 Nov 2014 11:35:02 +0000 (19:35 +0800)]
pcnet: fix Negative array index read
s->xmit_pos maybe assigned to a negative value (-1),
but in this branch variable s->xmit_pos as an index to
array s->buffer. Let's add a check for s->xmit_pos.
Gonglei [Thu, 20 Nov 2014 11:35:00 +0000 (19:35 +0800)]
net/slirp: fix memory leak
commit b412eb61 introduce 'cmd:' target for guestfwd,
and fwd don't be used in this scenario, and will leak
memory in true branch with 'cmd:'. Let's allocate memory
for fwd variable just in else statement.
Leif Lindholm [Wed, 19 Nov 2014 11:08:45 +0000 (11:08 +0000)]
hw/arm/virt: set stdout-path instead of linux,stdout-path
ePAPR 1.1 defines the stdout-path property, making the os-specific
linux,stdout-path property redundant. Change the DT setup for ARM virt
to use the generic property - supported by Linux since 3.15.
The old QEMU behaviour was not present in any released version of
QEMU, and was only added to QEMU after the kernel changed, so
this should not break any existing setups.
Signed-off-by: Leif Lindholm <[email protected]>
[PMM: add note to commit about the old behaviour never hving been
in a released version of QEMU] Signed-off-by: Peter Maydell <[email protected]>
The Move to Vector Status and Control Register (mtvscr) instruction
uses VRB as the source register. Fix the code generator to correctly
decode the VRB field. That is, use "rB(ctx->opcode)" instead of
"rD(ctx->opcode)".
Alexander Graf [Fri, 7 Nov 2014 21:12:48 +0000 (22:12 +0100)]
kvm: Fix memory slot page alignment logic
Memory slots have to be page aligned to get entered into KVM. There
is existing logic that tries to ensure that we pad memory slots that
are not page aligned to the biggest region that would still fit in the
alignment requirements.
Unfortunately, that logic is broken. It tries to calculate the start
offset based on the region size.
Fix up the logic to do the thing it was intended to do and document it
properly in the comment above it.
With this patch applied, I can successfully run an e500 guest with more
than 3GB RAM (at which point RAM starts overlapping subpage memory regions).
Peter Maydell [Thu, 20 Nov 2014 13:00:28 +0000 (13:00 +0000)]
Merge remote-tracking branch 'remotes/amit-migration/tags/for-2.2-2' into staging
Fix from a while back that unfortunately got ignored. Dave Gilbert says
it may actually fix a case where autoconverge would break on a repeat
migration (and not just fix stats).
Tracing: Fix simpletrace.py error on tcg enabled binary traces
simpletrace.py does not recognize the tcg option while reading trace-events file. In result simpletrace does not work on binary traces and tcg enabled events. Moved transformation of tcg enabled events to _read_events() which is used by simpletrace.
Peter Maydell [Tue, 18 Nov 2014 13:43:37 +0000 (13:43 +0000)]
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Block patches for 2.2.0-rc2
# gpg: Signature made Tue 18 Nov 2014 11:32:55 GMT using RSA key ID C88F2FD6
# gpg: Good signature from "Kevin Wolf <[email protected]>"
* remotes/kevin/tags/for-upstream:
block/raw-posix: Catch fsync() errors
block/raw-posix: Only sync after successful preallocation
block/raw-posix: Fix preallocating write() loop
raw-posix: The SEEK_HOLE code is flawed, rewrite it
raw-posix: SEEK_HOLE suffices, get rid of FIEMAP
raw-posix: Fix comment for raw_co_get_block_status()
During migration, the values read from migration stream during ram load
are not validated. Especially offset in host_from_stream_offset() and
also the length of the writes in the callers of said function.
To fix this, we need to make sure that the [offset, offset + length]
range fits into one of the allocated memory regions.
Validating addr < len should be sufficient since data seems to always be
managed in TARGET_PAGE_SIZE chunks.
Fixes: CVE-2014-7840
Note: follow-up patches add extra checks on each block->host access.
Max Reitz [Tue, 18 Nov 2014 10:23:04 +0000 (11:23 +0100)]
block/raw-posix: Fix preallocating write() loop
write() may write less bytes than requested; in this case, the number of
bytes written is returned. This is the byte count we should be
subtracting from the number of bytes still to be written, and not the
byte count we requested to write.
Peter Maydell [Sun, 16 Nov 2014 19:44:21 +0000 (19:44 +0000)]
exec: Handle multipage ranges in invalidate_and_set_dirty()
The code in invalidate_and_set_dirty() needs to handle addr/length
combinations which cross guest physical page boundaries. This can happen,
for example, when disk I/O reads large blocks into guest RAM which previously
held code that we have cached translations for. Unfortunately we were only
checking the clean/dirty status of the first page in the range, and then
were calling a tb_invalidate function which only handles ranges that don't
cross page boundaries. Fix the function to deal with multipage ranges.
The symptoms of this bug were that guest code would misbehave (eg segfault),
in particular after a guest reboot but potentially any time the guest
reused a page of its physical RAM for new code.
Kevin Wolf [Tue, 18 Nov 2014 10:01:05 +0000 (11:01 +0100)]
Merge remote-tracking branch 'mreitz/block' into queue-block
* mreitz/block:
raw-posix: The SEEK_HOLE code is flawed, rewrite it
raw-posix: SEEK_HOLE suffices, get rid of FIEMAP
raw-posix: Fix comment for raw_co_get_block_status()
raw-posix: The SEEK_HOLE code is flawed, rewrite it
On systems where SEEK_HOLE in a trailing hole seeks to EOF (Solaris,
but not Linux), try_seek_hole() reports trailing data instead.
Additionally, unlikely lseek() failures are treated badly:
* When SEEK_HOLE fails, try_seek_hole() reports trailing data. For
-ENXIO, there's in fact a trailing hole. Can happen only when
something truncated the file since we opened it.
* When SEEK_HOLE succeeds, SEEK_DATA fails, and SEEK_END succeeds,
then try_seek_hole() reports a trailing hole. This is okay only
when SEEK_DATA failed with -ENXIO (which means the non-trailing hole
found by SEEK_HOLE has since become trailing somehow). For other
failures (unlikely), it's wrong.
* When SEEK_HOLE succeeds, SEEK_DATA fails, SEEK_END fails (unlikely),
then try_seek_hole() reports bogus data [-1,start), which its caller
raw_co_get_block_status() turns into zero sectors of data. Could
theoretically lead to infinite loops in code that attempts to scan
data vs. hole forward.
Commit 5500316 (May 2012) implemented raw_co_is_allocated() as
follows:
1. If defined(CONFIG_FIEMAP), use the FS_IOC_FIEMAP ioctl
2. Else if defined(SEEK_HOLE) && defined(SEEK_DATA), use lseek()
3. Else pretend there are no holes
Later on, raw_co_is_allocated() was generalized to
raw_co_get_block_status().
Commit 4f11aa8 (May 2014) changed it to try the three methods in order
until success, because "there may be implementations which support
[SEEK_HOLE/SEEK_DATA] but not [FIEMAP] (e.g., NFSv4.2) as well as vice
versa."
Unfortunately, we used FIEMAP incorrectly: we lacked FIEMAP_FLAG_SYNC.
Commit 38c4d0a (Sep 2014) added it. Because that's a significant
speed hit, the next commit 7c159037 put SEEK_HOLE/SEEK_DATA first.
As you see, the obvious use of FIEMAP is wrong, and the correct use is
slow. I guess this puts it somewhere between -7 "The obvious use is
wrong" and -10 "It's impossible to get right" on Rusty Russel's Hard
to Misuse scale[*].
"Fortunately", the FIEMAP code is used only when
* SEEK_HOLE/SEEK_DATA aren't defined, but CONFIG_FIEMAP is
Uncommon. SEEK_HOLE had no XFS implementation between 2011 (when it
was introduced for ext4 and btrfs) and 2012.
* SEEK_HOLE/SEEK_DATA and CONFIG_FIEMAP are defined, but lseek() fails
Unlikely.
Thus, the FIEMAP code executes rarely. Makes it a nice hidey-hole for
bugs. Worse, bugs hiding there can theoretically bite even on a host
that has SEEK_HOLE/SEEK_DATA.
I don't want to worry about this crap, not even theoretically. Get
rid of it.
Peter Maydell [Thu, 13 Nov 2014 14:56:09 +0000 (14:56 +0000)]
target-arm: handle address translations that start at level 3
The ARMv8 address translation system defines that a page table walk
starts at a level which depends on the translation granule size
and the number of bits of virtual address that need to be resolved.
Where the translation granule is 64KB and the guest sets the
TCR.TxSZ field to between 35 and 39, it's actually possible to
start at level 3 (the final level). QEMU's implementation failed
to handle this case, and so we would set level to 2 and behave
incorrectly (including invoking the C undefined behaviour of
shifting left by a negative number). Correct the code that
determines the starting level to deal with the start-at-3 case,
by replacing the if-else ladder with an expression derived from
the ARM ARM pseudocode version.
This error was detected by the Coverity scan, which spotted
the potential shift by a negative number.
Peter Maydell [Mon, 17 Nov 2014 17:22:03 +0000 (17:22 +0000)]
Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging
A smattering of fixes for problems that Coverity reported.
# gpg: Signature made Mon 17 Nov 2014 17:03:25 GMT using RSA key ID 78C7AE83
# gpg: Good signature from "Paolo Bonzini <[email protected]>"
# gpg: aka "Paolo Bonzini <[email protected]>"
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg: It is not certain that the signature belongs to the owner.
# Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4 E2F7 7E15 100C CD36 69B1
# Subkey fingerprint: F133 3857 4B66 2389 866C 7682 BFFB D25F 78C7 AE83
zhanghailiang [Mon, 17 Nov 2014 05:57:34 +0000 (13:57 +0800)]
target-cris/translate.c: fix out of bounds read
In function t_gen_mov_TN_preg and t_gen_mov_preg_TN, The begin check about the
validity of in-parameter 'r' is useless. We still access cpu_PR[r] in the
follow code if it is invalid. Which will be an out-of-bounds read error.
Fix it by using assert() to ensure it is valid before using it.
Gonglei [Sat, 15 Nov 2014 10:06:44 +0000 (18:06 +0800)]
nvme: remove superfluous check
Operands don't affect result (CONSTANT_EXPRESSION_RESULT)
((n->bar.aqa >> AQA_ASQS_SHIFT) & AQA_ASQS_MASK) > 4095
is always false regardless of the values of its operands.
This occurs as the logical second operand of '||'.
Gonglei [Sat, 15 Nov 2014 10:06:42 +0000 (18:06 +0800)]
qga: fix false negative argument passing
Function send_response(s, &qdict->base) returns a negative number
when any failures occured. But strerror()'s parameter cannot be
negative. Let's change the testing condition and pass '-ret' to
strerr().
zhanghailiang [Fri, 14 Nov 2014 02:18:08 +0000 (10:18 +0800)]
libcacard: fix resource leak
In function connect_to_qemu(), getaddrinfo() will allocate memory
that is stored into server, it should be freed by using freeaddrinfo()
before connect_to_qemu() return.
Roger Pau Monne [Thu, 13 Nov 2014 17:42:09 +0000 (18:42 +0100)]
xen_disk: fix unmapping of persistent grants
This patch fixes two issues with persistent grants and the disk PV backend
(Qdisk):
- Keep track of memory regions where persistent grants have been mapped
since we need to unmap them as a whole. It is not possible to unmap a
single grant if it has been batch-mapped. A new check has also been added
to make sure persistent grants are only used if the whole mapped region
can be persistently mapped in the batch_maps case.
- Unmap persistent grants before switching to the closed state, so the
frontend can also free them.
Igor Mammedov [Fri, 14 Nov 2014 11:11:44 +0000 (11:11 +0000)]
pc: piix4_pm: init legacy PCI hotplug when running on Xen
If user starts QEMU with "-machine pc,accel=xen", then
compat property in xenfv won't work and it would cause error:
"Unsupported bus. Bus doesn't have property 'acpi-pcihp-bsel' set"
when PCI device is added with -device on QEMU CLI.
In case of Xen instead of using compat property, just use the fact
that xen doesn't use QEMU's fw_cfg/acpi tables to switch piix4_pm
into legacy PCI hotplug mode when Xen is enabled.
John Snow [Mon, 3 Nov 2014 23:56:19 +0000 (18:56 -0500)]
ahci: factor out FIS decomposition from handle_cmd
In order to make handle_cmd more readable at the macro level,
the details of how to decompose particular types of FIS packets
are left to helper functions.
In our case, the only type of FIS packet we currently expect to
see is a Register H2D FIS packet, but the gory details of its
decomposition are of no particular interest in handle_cmd.
This patch keeps the receipt of FIS packets and the decomposition
thereof separated to two different functions.
John Snow [Mon, 3 Nov 2014 23:56:18 +0000 (18:56 -0500)]
ahci: Check cmd_fis[1] more explicitly
Instead of checking for a known byte, inspect the
fields of this byte explicitly to produce more meaningful
error messages and improve the readability of this section.
John Snow [Mon, 3 Nov 2014 23:56:17 +0000 (18:56 -0500)]
ahci: Reorder error cases in handle_cmd
Error checking in ahci's handle_cmd is re-ordered so that we
initialize as few things as possible before we've done our
sanity checking. This simplifies returning from this call
in case of an error.
A check to make sure the DMA memory map succeeds with the
correct size is also added, and the debug print of the
command fis is cleaned up with its size corrected.
John Snow [Mon, 3 Nov 2014 23:56:16 +0000 (18:56 -0500)]
ahci: Fix FIS decomposition
This patch introduces a few changes to how FIS packets are
deciphered in the AHCI virtual device. The summary of
changes can be grouped into two pieces:
[A] Changes to how we apply a preliminary sieve to FISes,
[B] Changes in how we internalize a decomposed FIS.
== Changes to how we apply a preliminary sieve to FISes ==
(1) Packets may now either update the Control register or
the Command register, but not both. This is according
to the SATA 3.2 specification which states:
"...the device either initiates processing of the command
indicated in the Command register or initiates processing
of the control request indicated [...] depending on the
state of the C bit in the FIS."
See SATA 3.2 section 10.5.5.4, "Reception" in the 10.5.5
"Register Host to Device FIS" section.
This change accounts for the first two regions of change
within the diff. All other changes belong to the following
changes.
== Changes in how we internalize a decomposed FIS ==
(2) Instead of trying to extract the sector number out of the
FIS from bytes 4-10 and setting it with ide_set_sector,
we set the appropriate IDEState registers and trust that
ide_get_sector can retrieve the correct sector later.
By "constructing" the sector for use with ide_set_sector,
we are duplicating the mechanisms of ide_get_sector.
This change makes the FIS decomposition more obvious.
SATA 3.2 as a specification does not make the legacy
register mapping with respect to the D2H FIS obvious.
However, SATA 3.2 section 10.5.5.1 "Register Host to
Device FIS layout" describes all of the "cmd_fis"
bytes:
0 - FIS Type (0x27)
1 - Port Multiplier Port and Command Update flag
2 - ATA Command
3 - Features_Low
4 - LBA 7:0
5 - LBA 15:8
6 - LBA 23:16
7 - Device, AKA "Drive Select."
8 - LBA 31:24
9 - LBA 39:32
10 - LBA 47:40
11 - Features_High
12 - Count Low
13 - Count High
14 - ICC
15 - Control
16-19 - Auxiliary (for NCQ, defined per-command)
Most of these registers map to existing IDEState registers
in obvious ways, especially features, select, hob_features,
and nsector (count). ICC is reserved in older specifications
but is not supported in our implementation, and remains
unused here. The Control register is not valid for a command
that is trying to update the command register and is to be
considered reserved at this point.
What is not obvious is the LBA register mappings, but SATA 1.0
can help inform of us legacy device support, see SATA 1.0 section
8.5.2 "Register - Host to Device."
LBA 7:0 - Sector Number (sector)
LBA 15:8 - Cyl Low (lcyl)
LBA 23:16 - Cyl High (hcyl)
LBA 31:24 - Sector Num Exp. (hob_sector)
LBA 39:32 - Cyl Low Exp. (hob_lcyl)
LBA 47:40 - Cyl High Exp. (hob_hcyl)
These mappings help guide which registers the FIS should be decomposed
into/towards for CHS, LBA28 and LBA48 commands.
As a note: The prior confusion that can be seen in the documentation
arises from the fact that CHS and LBA28 commands use the low nybble
of the drive select register to store LBA 27:24, whereas LNA48 commands
use the hob_sector, hob_lcyl and hob_hcyl registers as explained above.
The decomposition as it stands now will correctly decompose CHS, LBA28
and LBA48 commands into their appropriate registers where the core
IDE/ATAPI layers can deal with them correctly.
See the below point for more information.
(3) We save cmd_fis[7] as ide_state->select, which informs
decisions about if we are using LBA or CHS.
This corrects a bug in AHCI wherein we attempt to set and/or
retrieve the sector number by using ide_set_sector and
ide_get_sector, which depend on the select register to
determine if we are using LBA or CHS.
Without this adjustment, LBA48 read/writes are currently
broken. Thanks to Eniac Zheng @ HP for pointing this out.
(4) Save cmd_fis[11] as ide_state->hob_feature, as defined in SATA 3.2.
(5) For several ATA commands, the sector count register set to 0
is a magic number that means 256 sectors. For LBA48 commands,
this means 65,536 sectors. We drop the magic sector correction
here, and trust the ide core layer to handle the conversion
appropriately, in ide_cmd_lba48_transform(). As it stands,
the current AHCI code is only compliant with LBA28 commands.
By simply removing the magic, it will work with LBA28 and LBA48.
(6) We expand FIS decomposition to include both ATAPI and IDE devices.
We leave the logic of determining if the fields are valid or not
to the respective layers.
This change intends to make it clearer that AHCI is only a
composition mechanism for the FIS packets: the meanings of
the registers is best left to the implementation layers for
those devices.
(7) Forcefully setting the feature, hcyl and lcyl registers for ATAPI
commands is removed.
- The hcyl and lcyl magic present here is valid at boot only,
and should not be overridden for every PACKET command.
- The feature register is defined as valid for the PACKET command,
so we should not suppress it. The ATAPI layer does not even
currently depend on or require 0x01 as mandatory.
John Snow [Mon, 3 Nov 2014 23:56:15 +0000 (18:56 -0500)]
ahci: add is_ncq predicate helper
A small helper to determine which S/ATA commands
are destined to be routed to the NCQ pathways.
This references SATA 3.2 section 13.6,
Native Command Queueing. See sections 13.6.4,
13.6.5, 13.6.6, 13.6.7 and 13.6.8 for all
SATA commands considered to be part of the
NCQ feature set. This is summarized in a small
list in section 13.6.3.1 and again in 13.6.3.2.
Not all of these NCQ commands are currently supported,
so the error pathways are adjusted slightly to be more
informative in the case they are encountered.
John Snow [Fri, 31 Oct 2014 20:03:39 +0000 (16:03 -0400)]
ide: Correct handling of malformed/short PRDTs
This impacts both BMDMA and AHCI HBA interfaces for IDE.
Currently, we confuse the difference between a PRDT having
"0 bytes" and a PRDT having "0 complete sectors."
When we receive an incomplete sector, inconsistent error checking
leads to an infinite loop wherein the call succeeds, but it
didn't give us enough bytes -- leading us to re-call the
DMA chain over and over again. This leads to, in the BMDMA case,
leaked memory for short PRDTs, and infinite loops and resource
usage in the AHCI case.
The .prepare_buf() callback is reworked to return the number of
bytes that it successfully prepared. 0 is a valid, non-error
answer that means the table was empty and described no bytes.
-1 indicates an error.
Our current implementation uses the io_buffer in IDEState to
ultimately describe the size of a prepared scatter-gather list.
Even though the AHCI PRDT/SGList can be as large as 256GiB, the
AHCI command header limits transactions to just 4GiB. ATA8-ACS3,
however, defines the largest transaction to be an LBA48 command
that transfers 65,536 sectors. With a 512 byte sector size, this
is just 32MiB.
Since our current state structures use the int type to describe
the size of the buffer, and this state is migrated as int32, we
are limited to describing 2GiB buffer sizes unless we change the
migration protocol.
For this reason, this patch begins to unify the assertions in the
IDE pathways that the scatter-gather list provided by either the
AHCI PRDT or the PCI BMDMA PRDs can only describe, at a maximum,
2GiB. This should be resilient enough unless we need a sector
size that exceeds 32KiB.
Further, the likelihood of any guest operating system actually
attempting to transfer this much data in a single operation is
very slim.
To this end, the IDEState variables have been updated to more
explicitly clarify our maximum supported size. Callers to the
prepare_buf callback have been reworked to understand the new
return code, and all versions of the prepare_buf callback have
been adjusted accordingly.
Lastly, the ahci_populate_sglist helper, relied upon by the
AHCI implementation of .prepare_buf() as well as the PCI
implementation of the callback have had overflow assertions
added to help make clear the reasonings behind the various
type changes.
[Added %d -> %"PRId64" fix John sent because off_pos changed from int to
int64_t.
--Stefan]
John Snow [Fri, 31 Oct 2014 20:03:38 +0000 (16:03 -0400)]
ahci: unify sglist preparation
The intent of this patch is to further unify the creation and
deletion of the sglist used for all AHCI transfers, including
emulated PIO, ATAPI R/W, and native DMA R/W.
By replacing ahci_start_transfer's call to ahci_populate_sglist
with ahci_dma_prepare_buf, we reduce the number of direct calls
where we manipulate the scatter-gather list in the AHCI code.
To make this switch, the constant "0" passed as an offset
in ahci_dma_prepare_buf is adjusted to use io_buffer_offset.
For DMA pathways, this has no effect: io_buffer_offset is always
updated to 0 at the beginning of a DMA transfer loop regardless.
DMA pathways through ide_dma_cb() update the io_buffer_offset
accordingly, and for circumstances where we might make several
trips through this loop, this may actually correct a design flaw.
For PIO pathways, the newly updated ahci_dma_prepare_buf will
now prepare the sglist at the correct offset. It will also set
io_buffer_size, but this is not used in the cmd_read_pio or
cmd_write_pio pathways.
John Snow [Fri, 31 Oct 2014 20:03:37 +0000 (16:03 -0400)]
ide: repair PIO transfers for cases where nsector > 1
Currently, for emulated PIO transfers through the AHCI device,
any attempt made to request more than a single sector's worth
of data will result in the same sector being transferred over
and over.
For example, if we request 8 sectors via PIO READ SECTORS, the
AHCI device will give us the same sector eight times.
This patch adds offset tracking into the PIO pathways so that
we can fulfill these requests appropriately.
John Snow [Tue, 11 Nov 2014 00:41:40 +0000 (19:41 -0500)]
ahci: Fix byte count regression for ATAPI/PIO
This patch fixes a regression caused by commit 659142ecf71a0da240ab0ff7cf929ee25c32b9bc.
The problem occurs when we wish to return early
from the ahci_start_transfer function, but are now
updating the transferred byte count in the AHCI
command header via ahci_commit_buf.
This will cause problems in the Windows 8 installer.
Don't update the byte count in the command header
for the transmission of ATAPI packets: These commands
will distort the final byte count of the actual data
payload.
The call to ahci_commit_buf remains in the "out"
portion of the call in order to clean up the sglist.
The byte count is maintained by forcing size to be 0.
Peter Maydell [Thu, 13 Nov 2014 15:44:16 +0000 (15:44 +0000)]
Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging
x86 and SCSI fixes. I left out the APIC device model
patches, pending confirmation from the submitter that they really
fix QNX.
# gpg: Signature made Thu 13 Nov 2014 15:13:38 GMT using RSA key ID 78C7AE83
# gpg: Good signature from "Paolo Bonzini <[email protected]>"
# gpg: aka "Paolo Bonzini <[email protected]>"
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg: It is not certain that the signature belongs to the owner.
# Primary key fingerprint: 46F5 9FBD 57D6 12E7 BFD4 E2F7 7E15 100C CD36 69B1
# Subkey fingerprint: F133 3857 4B66 2389 866C 7682 BFFB D25F 78C7 AE83
* remotes/bonzini/tags/for-upstream:
acpi: accurate overflow check
smbios: change 'ram_addr_t' variables to 'uint64_t'
kvmclock: Add comment explaining why we need cpu_clean_all_dirty()
target-i386: fix Coverity complaints about overflows
apic_common: migrate missing fields
target-i386: eliminate dead code and hoist common code out of "if"
virtio-scsi: Fix comment for VirtIOSCSIReq
virtio-scsi: dataplane: suppress guest notification
esp: Do not overwrite ESP_TCHI after reset
virtio-scsi: dataplane: fix allocation for 'cmd_vrings'
esp: fix coding standards
virtio-scsi: work around bug in old BIOSes
esp-pci: fixup deadlock with linux
SeokYeon Hwang [Wed, 5 Nov 2014 06:19:54 +0000 (15:19 +0900)]
smbios: change 'ram_addr_t' variables to 'uint64_t'
ram_addr_t should not be used except if referring to a RAMBlobk.
Using 'uint64_t' avoids a -Wconstant-conversion warning, which
clang >= 3.4 produces in "smbios_get_tables()".
Pavel Dovgalyuk [Thu, 28 Aug 2014 11:18:57 +0000 (15:18 +0400)]
apic_common: migrate missing fields
This patch adds missed sipi_vector and wait_for_sipi fields to a new
subsection of the vmstate of the apic_common module. Saving and loading
of these fields makes migration of the apic state deterministic.
Signed-off-by: Pavel Dovgalyuk <[email protected]>
[Initialize the field in pre_load and kvm_apic_realize. - Paolo] Signed-off-by: Paolo Bonzini <[email protected]>