virtio-net: support RSC v4/v6 tcp traffic for Windows HCK
This commit adds implementation of RX packets
coalescing, compatible with requirements of Windows
Hardware compatibility kit.
The device enables feature VIRTIO_NET_F_RSC_EXT in
host features if it supports extended RSC functionality
as defined in the specification.
This feature requires at least one of VIRTIO_NET_F_GUEST_TSO4,
VIRTIO_NET_F_GUEST_TSO6. Windows guest driver acks
this feature only if VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
is also present.
If the guest driver acks VIRTIO_NET_F_RSC_EXT feature,
the device coalesces TCPv4 and TCPv6 packets (if
respective VIRTIO_NET_F_GUEST_TSO feature is on,
populates extended RSC information in virtio header
and sets VIRTIO_NET_HDR_F_RSC_INFO bit in header flags.
The device does not recalculate checksums in the coalesced
packet, so they are not valid.
In this case:
All the data packets in a tcp connection are cached
to a single buffer in every receive interval, and will
be sent out via a timer, the 'virtio_net_rsc_timeout'
controls the interval, this value may impact the
performance and response time of tcp connection,
50000(50us) is an experience value to gain a performance
improvement, since the whql test sends packets every 100us,
so '300000(300us)' passes the test case, it is the default
value as well, tune it via the command line parameter
'rsc_interval' within 'virtio-net-pci' device, for example,
to launch a guest with interval set as '500000':
The timer will only be triggered if the packets pool is not empty,
and it'll drain off all the cached packets.
'NetRscChain' is used to save the segments of IPv4/6 in a
VirtIONet device.
A new segment becomes a 'Candidate' as well as it passed sanity check,
the main handler of TCP includes TCP window update, duplicated
ACK check and the real data coalescing.
An 'Candidate' segment means:
1. Segment is within current window and the sequence is the expected one.
2. 'ACK' of the segment is in the valid window.
Sanity check includes:
1. Incorrect version in IP header
2. An IP options or IP fragment
3. Not a TCP packet
4. Sanity size check to prevent buffer overflow attack.
5. An ECN packet
Even though, there might more cases should be considered such as
ip identification other flags, while it breaks the test because
windows set it to the same even it's not a fragment.
Normally it includes 2 typical ways to handle a TCP control flag,
'bypass' and 'finalize', 'bypass' means should be sent out directly,
while 'finalize' means the packets should also be bypassed, but this
should be done after search for the same connection packets in the
pool and drain all of them out, this is to avoid out of order fragment.
All the 'SYN' packets will be bypassed since this always begin a new'
connection, other flags such 'URG/FIN/RST/CWR/ECE' will trigger a
finalization, because this normally happens upon a connection is going
to be closed, an 'URG' packet also finalize current coalescing unit.
Statistics can be used to monitor the basic coalescing status, the
'out of order' and 'out of window' means how many retransmitting packets,
thus describe the performance intuitively.
Difference between ip v4 and v6 processing:
Fragment length in ipv4 header includes itself, while it's not
included for ipv6, thus means ipv6 can carry a real 65535 payload.
Note that main goal of implementing this feature in software
is to create reference setup for certification tests. In such
setups guest migration is not required, so the coalesced packets
not yet delivered to the guest will be lost in case of migration.
Igor Mammedov [Thu, 27 Dec 2018 14:13:34 +0000 (15:13 +0100)]
tests: acpi: use AcpiSdtTable::aml instead of AcpiSdtTable::header::signature
AcpiSdtTable::header::signature is the only remained field from
AcpiTableHeader structure used by tests. Instead of using packed
structure to access signature, just read it directly from table
blob and remove no longer used AcpiSdtTable::header / union and
keep only AcpiSdtTable::aml byte array.
Igor Mammedov [Thu, 27 Dec 2018 14:13:33 +0000 (15:13 +0100)]
tests: acpi: squash sanitize_fadt_ptrs() into test_acpi_fadt_table()
some parts of sanitize_fadt_ptrs() do redundant job
- locating FADT
- checking original checksum
There is no need to do it as test_acpi_fadt_table() already does that,
so drop duplicate code and move remaining fixup code into
test_acpi_fadt_table().
Igor Mammedov [Thu, 27 Dec 2018 14:13:32 +0000 (15:13 +0100)]
tests: smbios: fetch whole table in one step instead of reading it step by step
replace a bunch of ACPI_READ_ARRAY/ACPI_READ_FIELD macro, that read
SMBIOS table field by field with one memread() to fetch whole table
at once and drop no longer used ACPI_READ_ARRAY/ACPI_READ_FIELD macro.
Igor Mammedov [Thu, 27 Dec 2018 14:13:31 +0000 (15:13 +0100)]
tests: acpi: reuse fetch_table() in vmgenid-test
Move fetch_table() into acpi-utils.c renaming it to acpi_fetch_table()
and reuse it in vmgenid-test that reads RSDT and then tables it references,
to find and parse VMGNEID SSDT.
While at it wrap RSDT referenced tables enumeration into FOREACH macro
(similar to what we do with QLIST_FOREACH & co) to reuse it with bios and
vmgenid tests.
Igor Mammedov [Thu, 27 Dec 2018 14:13:30 +0000 (15:13 +0100)]
tests: acpi: reuse fetch_table() for fetching FACS and DSDT
It allows to remove a bit more of code duplication and
reuse common utility to get ACPI tables from guest (modulo RSDP).
While at it, consolidate signature checking into fetch_table() instead
of open-codding it.
Considering FACS is special and doesn't have checksum, make checksum
validation optin, the same goes for signature verification.
PS:
By pure accident, patch also fixes FACS not being tested against
reference table since it wasn't added to data::tables list.
But we managed not to regress it since reference file was added
by commit
(d25979380 acpi unit-test: add test files)
back in 2013
Igor Mammedov [Thu, 27 Dec 2018 14:13:29 +0000 (15:13 +0100)]
tests: acpi: simplify rsdt handling
RSDT referenced tables always have length at offset 4 and checksum at
offset 9, that's enough for reusing fetch_table() and replacing custom
RSDT fetching code with it.
While at it
* merge fetch_rsdt_referenced_tables() into test_acpi_rsdt_table()
* drop test_data::rsdt_table/rsdt_tables_addr/rsdt_tables_nr since
we need this data only for duration of test_acpi_rsdt_table() to
fetch other tables and use locals instead.
Igor Mammedov [Thu, 27 Dec 2018 14:13:28 +0000 (15:13 +0100)]
tests: acpi: make sure FADT is fetched only once
Whole FADT is fetched as part of RSDT referenced tables in
fetch_rsdt_referenced_tables() albeit a bit later than when FADT
is partially parsed in fadt_fetch_facs_and_dsdt_ptrs().
However there is no reason for calling fetch_rsdt_referenced_tables()
so late, just move it right after we fetched RSDT and before
fadt_fetch_facs_and_dsdt_ptrs(). That way we can reuse whole FADT
fetched by fetch_rsdt_referenced_tables() and avoid duplicate
custom fields fetching in fadt_fetch_facs_and_dsdt_ptrs().
While at it rename fadt_fetch_facs_and_dsdt_ptrs() to
test_acpi_fadt_table(). The follow up patch will merge
fadt_fetch_facs_and_dsdt_ptrs() into test_acpi_rsdt_table(),
so that we would end up calling only test_acpi_FOO_table()
for consistency for tables that require special processing.
Igor Mammedov [Thu, 27 Dec 2018 14:13:27 +0000 (15:13 +0100)]
tests: acpi: use AcpiSdtTable::aml in consistent way
Currently in the 1st case we store table body fetched from QEMU in
AcpiSdtTable::aml minus it's header but in the 2nd case when we
load reference aml from disk, it holds whole blob including header.
More over in the 1st case, we read header in separate AcpiSdtTable::header
structure and then jump over hoops to fixup tables and combine both.
Treat AcpiSdtTable::aml as whole table blob approach in both cases
and when fetching tables from QEMU, first get table length and then
fetch whole table into AcpiSdtTable::aml instead if doing it field
by field.
As result
* AcpiSdtTable::aml is used in consistent manner
* FADT fixups use offsets from spec instead of being shifted by
header length
* calculating checksums and dumping blobs becomes simpler
Li Qiang [Sat, 15 Dec 2018 12:03:52 +0000 (04:03 -0800)]
vhost-user: fix ioeventfd_enabled
Currently, the vhost-user-test assumes the eventfd is available.
However it's not true because the accel is qtest. So the
'vhost_set_vring_file' will not add fds to the msg and the server
side of vhost-user-test will be broken. The bug is in 'ioeventfd_enabled'.
We should make this function return true if not using kvm accel.
Jian Wang [Sat, 22 Dec 2018 10:27:28 +0000 (18:27 +0800)]
qemu: avoid memory leak while remove disk
Memset vhost_dev to zero in the vhost_dev_cleanup function.
This causes dev.vqs to be NULL, so that
vqs does not free up space when calling the g_free function.
This will result in a memory leak. But you can't release vqs
directly in the vhost_dev_cleanup function, because vhost_net
will also call this function, and vhost_net's vqs is assigned by array.
In order to solve this problem, we first save the pointer of vqs,
and release the space of vqs after vhost_dev_cleanup is called.
It's been marked as deprecated in QEMU v2.6.0 already, so really nobody
should use the legacy "ivshmem" device anymore (but use ivshmem-plain or
ivshmem-doorbell instead). Time to remove the deprecated device now.
Belatedly also update a mention of the deprecated "ivshmem" in the file
docs/specs/ivshmem-spec.txt to "ivshmem-doorbell". Missed in commit 5400c02b90b ("ivshmem: Split ivshmem-plain, ivshmem-doorbell off ivshmem").
Dongli Zhang [Sun, 16 Dec 2018 23:34:39 +0000 (07:34 +0800)]
msix: make pba size math more uniform
In msix_exclusive_bar the bar_pba_size is more than what the pba is
expected to have, although this never affects the bar size.
Specifically, the math in msix_init_exclusive_bar allocates too much
memory in some cases.
For example consider nentries = 8. msix_exclusive_bar will give us
bar_pba_size = 16. So 16 bytes. However 8 bytes would be enough - this
is all that the spec requires.
So in practice bar_pba_size sometimes allocates an extra 8 bytes but
never more.
Since each MSIX entry size is 16 bytes, and since we make sure that
table+pba is a power of two, this always leaves a multiple of 16 bytes
for the PBA, so extra 8 bytes have no effect.
However, its ugly to have pba size temporary variable have an incorrect
value. For consistency switch to the formula used in msix_init.
We better stop right away. For now, errors would be partially ignored
(so the guest might get informed or the device might get unplugged),
although actual plug/unplug will be reported as failed to the user.
While at it, properly move the check to the pre_plug handler for the plug
case, as we can test the slot state before the device will be realized.
Peter Maydell [Mon, 14 Jan 2019 17:35:00 +0000 (17:35 +0000)]
Merge remote-tracking branch 'remotes/ehabkost/tags/x86-next-pull-request' into staging
x86 queue, 2019-01-14
* Reenable RDTSCP support on Opteron_G[345] CPU models CPU models
(Borislav Petkov)
* host-phys-bits-limit option for better control of 5-level EPT
(Eduardo Habkost)
* Disable MPX support on named CPU models (Paolo Bonzini)
* expose HV_CPUID_ENLIGHTMENT_INFO.EAX and HV_CPUID_NESTED_FEATURES.EAX
as feature words (Vitaly Kuznetsov)
# gpg: Signature made Mon 14 Jan 2019 14:33:55 GMT
# gpg: using RSA key 2807936F984DC5A6
# gpg: Good signature from "Eduardo Habkost <[email protected]>"
# Primary key fingerprint: 5A32 2FD5 ABC4 D3DB ACCF D1AA 2807 936F 984D C5A6
* remotes/ehabkost/tags/x86-next-pull-request:
i386/kvm: add a comment explaining why .feat_names are commented out for Hyper-V feature bits
x86: host-phys-bits-limit option
target/i386: Disable MPX support on named CPU models
target-i386: Reenable RDTSCP support on Opteron_G[345] CPU models CPU models
i386/kvm: expose HV_CPUID_ENLIGHTMENT_INFO.EAX and HV_CPUID_NESTED_FEATURES.EAX as feature words
Eduardo Habkost [Tue, 11 Dec 2018 19:25:27 +0000 (17:25 -0200)]
x86: host-phys-bits-limit option
Some downstream distributions of QEMU set host-phys-bits=on by
default. This worked very well for most use cases, because
phys-bits really didn't have huge consequences. The only
difference was on the CPUID data seen by guests, and on the
handling of reserved bits.
This changed in KVM commit 855feb673640 ("KVM: MMU: Add 5 level
EPT & Shadow page table support"). Now choosing a large
phys-bits value for a VM has bigger impact: it will make KVM use
5-level EPT even when it's not really necessary. This means
using the host phys-bits value may not be the best choice.
Management software could address this problem by manually
configuring phys-bits depending on the size of the VM and the
amount of MMIO address space required for hotplug. But this is
not trivial to implement.
However, there's another workaround that would work for most
cases: keep using the host phys-bits value, but only if it's
smaller than 48. This patch makes this possible by introducing a
new "-cpu" option: "host-phys-bits-limit". Management software
or users can make sure they will always use 4-level EPT using:
"host-phys-bits=on,host-phys-bits-limit=48".
This behavior is still not enabled by default because QEMU
doesn't enable host-phys-bits=on by default. But users,
management software, or downstream distributions may choose to
change their defaults using the new option.
Paolo Bonzini [Thu, 20 Dec 2018 12:11:00 +0000 (13:11 +0100)]
target/i386: Disable MPX support on named CPU models
MPX support is being phased out by Intel; GCC has dropped it, Linux
is also going to do that. Even though KVM will have special code
to support MPX after the kernel proper stops enabling it in XCR0,
we probably also want to deprecate that in a few years. As a start,
do not enable it by default for any named CPU model starting with
the 4.0 machine types; this include Skylake, Icelake and Cascadelake.
Opteron_G2 - being family 15, model 6, doesn't have RDTSCP support
(the real hardware doesn't have it. K8 got RDTSCP support with the NPT
models, i.e., models >= 0x40).
Document the host's minimum required kernel version, while at it.
Vitaly Kuznetsov [Mon, 26 Nov 2018 13:59:58 +0000 (14:59 +0100)]
i386/kvm: expose HV_CPUID_ENLIGHTMENT_INFO.EAX and HV_CPUID_NESTED_FEATURES.EAX as feature words
It was found that QMP users of QEMU (e.g. libvirt) may need
HV_CPUID_ENLIGHTMENT_INFO.EAX/HV_CPUID_NESTED_FEATURES.EAX information. In
particular, 'hv_tlbflush' and 'hv_evmcs' enlightenments are only exposed in
HV_CPUID_ENLIGHTMENT_INFO.EAX.
HV_CPUID_NESTED_FEATURES.EAX is exposed for two reasons: convenience
(we don't need to export it from hyperv_handle_properties() and as
future-proof for Enlightened MSR-Bitmap, PV EPT invalidation and
direct virtual flush features.
Peter Maydell [Mon, 14 Jan 2019 13:54:17 +0000 (13:54 +0000)]
Merge remote-tracking branch 'remotes/aperard/tags/pull-xen-20190114' into staging
Xen queue
* Xen PV backend 'qdevification'.
Starting with xen_disk.
* Performance improvements for xen-block.
* Remove of the Xen PV domain builder.
* bug fixes.
# gpg: Signature made Mon 14 Jan 2019 13:46:33 GMT
# gpg: using RSA key 0CF5572FD7FB55AF
# gpg: Good signature from "Anthony PERARD <[email protected]>"
# gpg: aka "Anthony PERARD <[email protected]>"
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg: It is not certain that the signature belongs to the owner.
# Primary key fingerprint: 5379 2F71 024C 600F 778A 7161 D8D5 7199 DF83 42C8
# Subkey fingerprint: F80C 0063 08E2 2CFD 8A92 E798 0CF5 572F D7FB 55AF
* remotes/aperard/tags/pull-xen-20190114: (25 commits)
xen-block: avoid repeated memory allocation
xen-block: improve response latency
xen-block: improve batching behaviour
xen: Replace few mentions of xend by libxl
Remove broken Xen PV domain builder
xen: remove the legacy 'xen_disk' backend
MAINTAINERS: add myself as a Xen maintainer
xen: automatically create XenBlockDevice-s
xen: add a mechanism to automatically create XenDevice-s...
xen: add implementations of xen-block connect and disconnect functions...
xen: purge 'blk' and 'ioreq' from function names in dataplane/xen-block.c
xen: remove 'ioreq' struct/varable/field names from dataplane/xen-block.c
xen: remove 'XenBlkDev' and 'blkdev' names from dataplane/xen-block
xen: add header and build dataplane/xen-block.c
xen: remove unnecessary code from dataplane/xen-block.c
xen: duplicate xen_disk.c as basis of dataplane/xen-block.c
xen: add event channel interface for XenDevice-s
xen: add grant table interface for XenDevice-s
xen: add xenstore watcher infrastructure
xen: create xenstore areas for XenDevice-s
...
Tim Smith [Wed, 12 Dec 2018 11:16:26 +0000 (11:16 +0000)]
xen-block: avoid repeated memory allocation
The xen-block dataplane currently allocates memory to hold the data for
each request as that request is used, and frees it afterwards. Because
it requires page-aligned blocks, this interacts poorly with non-page-
aligned allocations and balloons the heap.
Instead, allocate the maximum possible buffer size required for the
protocol, which is BLKIF_MAX_SEGMENTS_PER_REQUEST (currently 11) pages
when the request structure is created, and keep that buffer until it is
destroyed. Since the requests are re-used via a free list, this should
actually improve memory usage.
Signed-off-by: Tim Smith <[email protected]>
Re-based and commit comment adjusted.
Tim Smith [Wed, 12 Dec 2018 11:16:25 +0000 (11:16 +0000)]
xen-block: improve response latency
If the I/O ring is full, the guest cannot send any more requests
until some responses are sent. Only sending all available responses
just before checking for new work does not leave much time for the
guest to supply new work, so this will cause stalls if the ring gets
full. Also, not completing reads as soon as possible adds latency
to the guest.
To alleviate that, complete IO requests as soon as they come back.
xen_block_send_response() already returns a value indicating whether
a notify should be sent, which is all the batching we need.
Signed-off-by: Tim Smith <[email protected]>
Re-based and commit comment adjusted.
Tim Smith [Wed, 12 Dec 2018 11:16:24 +0000 (11:16 +0000)]
xen-block: improve batching behaviour
When I/O consists of many small requests, performance is improved by
batching them together in a single io_submit() call. When there are
relatively few requests, the extra overhead is not worth it. This
introduces a check to start batching I/O requests via blk_io_plug()/
blk_io_unplug() in an amount proportional to the number which were
already in flight at the time we started reading the ring.
Signed-off-by: Tim Smith <[email protected]>
Re-based and commit comment adjusted.
Paul Durrant [Tue, 8 Jan 2019 14:49:02 +0000 (14:49 +0000)]
MAINTAINERS: add myself as a Xen maintainer
I have made many significant contributions to the Xen code in QEMU,
particularly the recent patches introducing a new PV device framework.
I intend to make further significant contributions, porting other PV back-
ends to the new framework with the intent of eventually removing the
legacy code. It therefore seems reasonable that I become a maintainer of
the Xen code.
Paul Durrant [Tue, 8 Jan 2019 14:49:01 +0000 (14:49 +0000)]
xen: automatically create XenBlockDevice-s
This patch adds create and destroy function for XenBlockDevice-s so that
they can be created automatically when the Xen toolstack instantiates a new
PV backend via xenstore. When the XenBlockDevice is created this way it is
also necessary to create a 'drive' which matches the configuration that the
Xen toolstack has written into xenstore. This is done by formulating the
parameters necessary for each 'blockdev' layer of the drive and then using
qmp_blockdev_add() to create the layers. Also, for compatibility with the
legacy 'xen_disk' implementation, an iothread is automatically created for
the new XenBlockDevice. This, like the driver layers, will be destroyed
after the XenBlockDevice is unrealized.
The legacy backend scan for 'qdisk' is removed by this patch, which makes
the 'xen_disk' code is redundant. The code will be removed by a subsequent
patch.
Paul Durrant [Tue, 8 Jan 2019 14:49:00 +0000 (14:49 +0000)]
xen: add a mechanism to automatically create XenDevice-s...
...that maintains compatibility with existing Xen toolstacks.
Xen toolstacks instantiate PV backends by simply writing information into
xenstore and expecting a backend implementation to be watching for this.
This patch adds a new 'xen-backend' module to allow individual XenDevice
implementations to register create and destroy functions. The creator
will be called when a tool-stack instantiates a new backend in this way,
and the destructor will then be called after the resulting XenDevice
object is unrealized.
To support this it is also necessary to add new watchers into the XenBus
implementation to handle enumeration of new backends and also destruction
of XenDevice-s when the toolstack sets the backend 'online' key to 0.
NOTE: This patch only adds the framework. A subsequent patch will add a
creator function for xen-block devices.
Paul Durrant [Tue, 8 Jan 2019 14:48:59 +0000 (14:48 +0000)]
xen: add implementations of xen-block connect and disconnect functions...
...and wire in the dataplane.
This patch adds the remaining code to make the xen-block XenDevice
functional. The parameters that a block frontend expects to find are
populated in the backend xenstore area, and the 'ring-ref' and
'event-channel' values specified in the frontend xenstore area are
mapped/bound and used to set up the dataplane.
Paul Durrant [Tue, 8 Jan 2019 14:48:58 +0000 (14:48 +0000)]
xen: purge 'blk' and 'ioreq' from function names in dataplane/xen-block.c
This is a purely cosmetic patch that purges remaining use of 'blk' and
'ioreq' in local function names, and then makes sure all functions are
prefixed with 'xen_block_'.
Paul Durrant [Tue, 8 Jan 2019 14:48:57 +0000 (14:48 +0000)]
xen: remove 'ioreq' struct/varable/field names from dataplane/xen-block.c
This is a purely cosmetic patch that purges the name 'ioreq' from struct,
variable and field names. (This name has been problematic for a long time
as 'ioreq' is the name used for generic I/O requests coming from Xen).
The patch replaces 'struct ioreq' with a new 'XenBlockRequest' type and
'ioreq' field/variable names with 'request', and then does necessary
fix-up to adhere to coding style.
Function names are not modified by this patch. They will be dealt with in
a subsequent patch.
Paul Durrant [Tue, 8 Jan 2019 14:48:56 +0000 (14:48 +0000)]
xen: remove 'XenBlkDev' and 'blkdev' names from dataplane/xen-block
This is a purely cosmetic patch that substitutes the old 'struct XenBlkDev'
name with 'XenBlockDataPlane' and 'blkdev' field/variable names with
'dataplane', and then does necessary fix-up to adhere to coding style.
Paul Durrant [Tue, 8 Jan 2019 14:48:55 +0000 (14:48 +0000)]
xen: add header and build dataplane/xen-block.c
This patch adds the transformations necessary to get dataplane/xen-block.c
to build against the new XenBus/XenDevice framework. MAINTAINERS is also
updated due to the introduction of dataplane/xen-block.h.
NOTE: Existing data structure names are retained for the moment. These will
be modified by subsequent patches. A typedef for XenBlockDataPlane
has been added to the header (based on the old struct XenBlkDev name
for the moment) so that the old names don't need to leak out of the
dataplane code.
Paul Durrant [Tue, 8 Jan 2019 14:48:54 +0000 (14:48 +0000)]
xen: remove unnecessary code from dataplane/xen-block.c
Not all of the code duplicated from xen_disk.c is required as the basis for
the new dataplane implementation so this patch removes extraneous code,
along with the legacy #includes and calls to the legacy xen_pv_printf()
function. Error messages are changed to be reported using error_report().
NOTE: The code is still not yet built. Further transformations will be
required to make it correctly interface to the new XenBus/XenDevice
framework. They will be delivered in a subsequent patch.
Paul Durrant [Tue, 8 Jan 2019 14:48:53 +0000 (14:48 +0000)]
xen: duplicate xen_disk.c as basis of dataplane/xen-block.c
The new xen-block XenDevice implementation requires the same core
dataplane as the legacy xen_disk implementation it will eventually replace.
This patch therefore copies the legacy xen_disk.c source module into a new
dataplane/xen-block.c source module as the basis for the new dataplane and
adjusts the MAINTAINERS file accordingly.
NOTE: The duplicated code is not yet built. It is simply put into place by
this patch (just fixing style violations) such that the
modifications that will need to be made to the code are not
conflated with code movement, thus making review harder.
Paul Durrant [Tue, 8 Jan 2019 14:48:52 +0000 (14:48 +0000)]
xen: add event channel interface for XenDevice-s
The legacy PV backend infrastructure provides functions to bind, unbind
and send notifications to event channnels. Similar functionality will be
required by XenDevice implementations so this patch adds the necessary
support.
Patch "xen: add event channel interface for XenDevice-s" makes use of
the type xenevtchn_port_or_error_t, but this isn't avaiable before Xen
4.7. Also the function xen_device_bind_event_channel assign the return
value of xenevtchn_bind_interdomain to channel->local_port but check the
result for error with xendev->local_port.
Fix by:
- removing local_port from struct XenDevice as it isn't use anywere.
- adding a compatibility typedef for xenevtchn_port_or_error_t for Xen
4.6 and earlier.
As extra, replace the type of XenEventChannel->local_port by
evtchn_port_t.
Paul Durrant [Tue, 8 Jan 2019 14:48:51 +0000 (14:48 +0000)]
xen: add grant table interface for XenDevice-s
The legacy PV backend infrastructure provides functions to map, unmap and
copy pages granted by frontends. Similar functionality will be required
by XenDevice implementations so this patch adds the necessary support.
Paul Durrant [Tue, 8 Jan 2019 14:48:50 +0000 (14:48 +0000)]
xen: add xenstore watcher infrastructure
A Xen PV frontend communicates its state to the PV backend by writing to
the 'state' key in the frontend area in xenstore. It is therefore
necessary for a XenDevice implementation to be notified whenever the
value of this key changes.
This patch adds code to do this as follows:
- an 'fd handler' is registered on the libxenstore handle which will be
triggered whenever a 'watch' event occurs
- primitives are added to xen-bus-helper to add or remove watch events
- a list of Notifier objects is added to XenBus to provide a mechanism
to call the appropriate 'watch handler' when its associated event
occurs
The xen-block implementation is extended with a 'frontend_changed' method,
which calls as-yet stub 'connect' and 'disconnect' functions when the
relevant frontend state transitions occur. A subsequent patch will supply
a full implementation for these functions.
Paul Durrant [Tue, 8 Jan 2019 14:48:49 +0000 (14:48 +0000)]
xen: create xenstore areas for XenDevice-s
This patch adds a new source module, xen-bus-helper.c, which builds on
basic libxenstore primitives to provide functions to create (setting
permissions appropriately) and destroy xenstore areas, and functions to
'printf' and 'scanf' nodes therein. The main xen-bus code then uses
these primitives [1] to initialize and destroy the frontend and backend
areas for a XenDevice during realize and unrealize respectively.
The 'xen-block' implementation is extended with a 'get_name' method that
returns the VBD number. This number is required to 'name' the xenstore
areas.
NOTE: An exit handler is also added to make sure the xenstore areas are
cleaned up if QEMU terminates without devices being unrealized.
[1] The 'scanf' functions are actually not yet needed, but they will be
needed by code delivered in subsequent patches.
Paul Durrant [Tue, 8 Jan 2019 14:48:48 +0000 (14:48 +0000)]
xen: introduce 'xen-block', 'xen-disk' and 'xen-cdrom'
This patch adds new XenDevice-s: 'xen-disk' and 'xen-cdrom', both derived
from a common 'xen-block' parent type. These will eventually replace the
'xen_disk' (note the underscore rather than hyphen) legacy PV backend but
it is illustrative to build up the implementation incrementally, along with
the XenBus/XenDevice framework. Subsequent patches will therefore add to
these devices' implementation as new features are added to the framework.
After this patch has been applied it is possible to instantiate new
'xen-disk' or 'xen-cdrom' devices with a single 'vdev' parameter, which
accepts values adhering to the Xen VBD naming scheme [1]. For example, a
command-line instantiation of a xen-disk can be done with an argument
similar to the following:
-device xen-disk,vdev=hda
The implementation of the vdev parameter formulates the appropriate VBD
number for use in the PV protocol.
Paul Durrant [Tue, 8 Jan 2019 14:48:47 +0000 (14:48 +0000)]
xen: introduce new 'XenBus' and 'XenDevice' object hierarchy
This patch adds the basic boilerplate for a 'XenBus' object that will act
as a parent to 'XenDevice' PV backends.
A new 'XenBridge' object is also added to connect XenBus to the system bus.
The XenBus object is instantiated by a new xen_bus_init() function called
from the same sites as the legacy xen_be_init() function.
Subsequent patches will flesh-out the functionality of these objects.
Paul Durrant [Tue, 8 Jan 2019 14:48:46 +0000 (14:48 +0000)]
xen: re-name XenDevice to XenLegacyDevice...
...and xen_backend.h to xen-legacy-backend.h
Rather than attempting to convert the existing backend infrastructure to
be QOM compliant (which would be hard to do in an incremental fashion),
subsequent patches will introduce a completely new framework for Xen PV
backends. Hence it is necessary to re-name parts of existing code to avoid
name clashes. The re-named 'legacy' infrastructure will be removed once all
backends have been ported to the new framework.
This patch is purely cosmetic. No functional change.
Zhao Yan [Wed, 5 Dec 2018 07:58:30 +0000 (02:58 -0500)]
xen/pt: allow passthrough of devices with bogus interrupt pin
For some pci device, even its PCI_INTERRUPT_PIN is not 0, it actually
doesn't support INTx mode, so its machine irq read from host sysfs is 0.
In that case, report PCI_INTERRUPT_PIN as 0 to guest and let passthrough
continue.
Peter Maydell [Mon, 19 Nov 2018 16:26:58 +0000 (16:26 +0000)]
hw/xen/xen_pt_graphics: Don't trust the BIOS ROM contents so much
Coverity (CID 796599) points out that xen_pt_setup_vga() trusts
the rom->size field in the BIOS ROM from a PCI passthrough VGA
device, and uses it as an index into the memory which contains
the BIOS image. A corrupt BIOS ROM could therefore cause us to
index off the end of the buffer.
Check that the size is within bounds before we use it.
We are also trusting the pcioffset field, and assuming that
the whole rom_header is present; Coverity doesn't notice these,
but check them too.
Peter Maydell [Mon, 14 Jan 2019 11:41:43 +0000 (11:41 +0000)]
Merge remote-tracking branch 'remotes/palmer/tags/riscv-for-master-3.2-part2' into staging
RISC-V Updates for 3.2, Part 2
This patch set contains a handful of Michael's CSR-related cleanups,
which should allow us to proceed with more outstanding bug fixes that
depend on them.
Additionally, there is a patch that turns on USB. This works for me
when the kernel has the appropriate drivers (which will soon be in
defconfig) and I pass
# gpg: Signature made Fri 11 Jan 2019 18:05:02 GMT
# gpg: using RSA key EF4CA1502CCBAB41
# gpg: Good signature from "Palmer Dabbelt <[email protected]>"
# gpg: aka "Palmer Dabbelt <[email protected]>"
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 00CE 76D1 8349 60DF CE88 6DF8 EF4C A150 2CCB AB41
* remotes/palmer/tags/riscv-for-master-3.2-part2:
default-configs: Enable USB support for RISC-V machines
RISC-V: Implement existential predicates for CSRs
RISC-V: Implement atomic mip/sip CSR updates
RISC-V: Implement modular CSR helper interface
Peter Maydell [Mon, 14 Jan 2019 10:11:36 +0000 (10:11 +0000)]
Merge remote-tracking branch 'remotes/ehabkost/tags/machine-next-pull-request' into staging
Work around test-qht-par + gprof issues
Travis CI jobs are failing because of test-qht-par when gprof is
enabled. Temporarily disable test-qht-par if gprof is enabled,
until we fix the bug.
# gpg: Signature made Fri 11 Jan 2019 18:23:29 GMT
# gpg: using RSA key 2807936F984DC5A6
# gpg: Good signature from "Eduardo Habkost <[email protected]>"
# Primary key fingerprint: 5A32 2FD5 ABC4 D3DB ACCF D1AA 2807 936F 984D C5A6
* remotes/ehabkost/tags/machine-next-pull-request:
tests: Disable qht-bench parallel test when using gprof
configure: Let the TARGET_GPROF var use the regular 'y' for Yes
* remotes/bonzini/tags/for-upstream: (34 commits)
avoid TABs in files that only contain a few
remove space-tab sequences
scripts: add script to convert multiline comments into 4-line format
hw/watchdog/wdt_i6300esb: remove a unnecessary comment
checkpatch: warn about qemu/queue.h head structs that are not typedef-ed
qemu/queue.h: simplify reverse access to QTAILQ
qemu/queue.h: reimplement QTAILQ without pointer-to-pointers
qemu/queue.h: remove Q_TAILQ_{HEAD,ENTRY}
qemu/queue.h: typedef QTAILQ heads
qemu/queue.h: leave head structs anonymous unless necessary
vfio: make vfio_address_spaces static
qemu/queue.h: do not access tqe_prev directly
test: replace gtester with a TAP driver
test: execute g_test_run when tests are skipped
qga: drop < Vista compatibility
build-sys: build with Vista API by default
build-sys: move windows defines in osdep.h header
build-sys: don't include windows.h, osdep.h does it
scsi: esp: Defer command completion until previous interrupts have been handled
esp-pci: Fix status register write erase control
...
Paolo Bonzini [Thu, 13 Dec 2018 22:37:37 +0000 (23:37 +0100)]
avoid TABs in files that only contain a few
Most files that have TABs only contain a handful of them. Change
them to spaces so that we don't confuse people.
disas, standard-headers, linux-headers and libdecnumber are imported
from other projects and probably should be exempted from the check.
Outside those, after this patch the following files still contain both
8-space and TAB sequences at the beginning of the line. Many of them
have a majority of TABs, or were initially committed with all tabs.
Paolo Bonzini [Fri, 14 Dec 2018 09:33:22 +0000 (10:33 +0100)]
scripts: add script to convert multiline comments into 4-line format
Since we're adding checkpatch rules to enforce 4-line multiline comment
format, i.e. with lone /* and */, this script can be run on existing
code so that the comment style does not become inconsistent within a
file.
The alternative to awk-in-a-shell-script could be Perl, which also
supports -i directly, but a2p seems to have bitrotten and I didn't quite
feel like writing this twice...
Peng Hao [Sat, 8 Dec 2018 07:18:31 +0000 (15:18 +0800)]
hw/watchdog/wdt_i6300esb: remove a unnecessary comment
The registered memory region of i6300esb is not suitable for coalesced
mmio, because a write for the region may trigger an immediate action
and can't be delayed.
Paolo Bonzini [Thu, 6 Dec 2018 11:01:53 +0000 (12:01 +0100)]
qemu/queue.h: reimplement QTAILQ without pointer-to-pointers
QTAILQ is a doubly linked list, with a pointer-to-pointer to the last
element from the head, and the previous element from each node.
But if you squint enough, QTAILQ becomes a combination of a singly-linked
forwards list, and another singly-linked list which goes backwards and
is circular. This is the idea that lets QTAILQ implement reverse
iteration: only, because the backwards list points inside the node,
accessing the previous element needs to go two steps back and one
forwards.
What this patch does is implement it in these terms, without actually
changing the in-memory layout at all. The coexistence of the two lists
is realized by making QTAILQ_HEAD and QTAILQ_ENTRY unions of the forwards
pointer and a generic QTailQLink node. Thq QTailQLink can walk the list in
both directions; the union is needed so that the forwards pointer can
have the correct type, as a sort of poor man's template. While there
are other ways to get the same layout without a union, this one has
the advantage of simpler operation in the debugger, because the fields
tqh_first and tqe_next still exist as before the patch. Those fields are
also used by scripts/qemugdb/mtree.py, so it's a good idea to preserve them.
The advantage of the new representation is that the two-back-one-forward
dance done by backwards accesses can be done all while operating on
QTailQLinks. No casting to the head struct is needed anymore because,
even though the QTailQLink's forward pointer is a void *, we can use
typeof to recover the correct type. This patch only changes the
implementation, not the interface. The next patch will remove the head
struct name from the backwards visit macros.
Paolo Bonzini [Thu, 6 Dec 2018 10:56:15 +0000 (11:56 +0100)]
qemu/queue.h: typedef QTAILQ heads
This will be needed when we change the QTAILQ head and elem structs
to unions. However, it is also consistent with the usage elsewhere
in QEMU for other list head structs (see for example FsMountList).
Note that most QTAILQs only need their name in order to do backwards
walks. Those do not break with the struct->union change, and anyway
the change will also remove the need to name heads when doing backwards
walks, so those are not touched here.
Paolo Bonzini [Thu, 6 Dec 2018 10:58:10 +0000 (11:58 +0100)]
qemu/queue.h: leave head structs anonymous unless necessary
Most list head structs need not be given a name. In most cases the
name is given just in case one is going to use QTAILQ_LAST, QTAILQ_PREV
or reverse iteration, but this does not apply to lists of other kinds,
and even for QTAILQ in practice this is only rarely needed. In addition,
we will soon reimplement those macros completely so that they do not
need a name for the head struct. So clean up everything, not giving a
name except in the rare case where it is necessary.
Paolo Bonzini [Thu, 29 Nov 2018 17:45:31 +0000 (18:45 +0100)]
test: replace gtester with a TAP driver
gtester is deprecated by upstream glib (see for example the announcement
at https://blog.gtk.org/2018/07/11/news-from-glib-2-58/) and it does
not support tests that call g_test_skip in some glib stable releases.
glib suggests instead using Automake's TAP support, which gtest itself
supports since version 2.38 (QEMU's minimum requirement is 2.40).
We do not support Automake, but we can use Automake's code to beautify
the TAP output. I chose to use the Perl copy rather than the shell/awk
one, with some changes so that it can accept TAP through stdin, in order
to reuse Perl's TAP parsing package. This also avoids duplicating the
parser between tap-driver.pl and tap-merge.pl.
# gpg: Signature made Thu 10 Jan 2019 14:28:23 GMT
# gpg: using RSA key 2807936F984DC5A6
# gpg: Good signature from "Eduardo Habkost <[email protected]>"
# Primary key fingerprint: 5A32 2FD5 ABC4 D3DB ACCF D1AA 2807 936F 984D C5A6
* remotes/ehabkost/tags/machine-next-pull-request:
qom: Don't keep error value between object_property_parse() calls
qdev: fix -device scsi-hd,help regression
machine: Use shorter format for GlobalProperty arrays
machine: Eliminate unnecessary stringify() usage
spapr: Eliminate SPAPR_PCI_2_7_MMIO_WIN_SIZE macro
memory-device: rewrite address assignment using ranges
range: add some more functions
Mention that QMP 'cpu-add' will be deprecated
Update that HMP 'cpu-add' is deprecated in 4.0
qemu-deprecated.texi: Rename the HMP section
Paolo Bonzini [Thu, 29 Nov 2018 17:45:30 +0000 (18:45 +0100)]
test: execute g_test_run when tests are skipped
Sometimes a test's main() function recognizes that the environment
does not support the test, and therefore exits. In this case, we
still should run g_test_run() so that a TAP harness will print the
test plan ("1..0") and the test will be marked as skipped.
Building QGA for XP seems possible so far: the dependency on
libqemuutil.a implies building qemu-thread-win32.c, which requires
Vista API since commit 12f8def0 (v2.9). But qemu-thread isn't being
used in QGA, the resulting binary may still work on XP. XP is no
longer supported for the past 4.5y, it's time to drop support for it.
Both qemu & qga build with Vista API by default already, by defining
_WIN32_WINNT 0x0600. Set it globally in osdep.h instead.
This replaces WINVER by _WIN32_WINNT in osdep.h. WINVER doesn't seem
to be really useful these days.
(see also https://blogs.msdn.microsoft.com/oldnewthing/20070411-00/?p=27283)
Guenter Roeck [Thu, 29 Nov 2018 17:17:42 +0000 (09:17 -0800)]
scsi: esp: Defer command completion until previous interrupts have been handled
The guest OS reads RSTAT, RSEQ, and RINTR, and expects those registers
to reflect a consistent state. However, it is possible that the registers
can change after RSTAT was read, but before RINTR is read, when
esp_command_complete() is called.
The guest OS would then try to handle INTR_BS combined with an old
value of RSTAT. This sometimes resulted in lost events, spurious
interrupts, guest OS confusion, and stalled SCSI operations.
A typical guest error log (observed with various versions of Linux)
looks as follows.
Guenter Roeck [Wed, 28 Nov 2018 21:56:10 +0000 (13:56 -0800)]
esp-pci: Fix status register write erase control
Per AM53C974 datasheet, definition of "SCSI Bus and Control (SBAC)"
register:
Bit 24 'STATUS' Write Erase Control
This bit controls the Write Erase feature on bits 3:1 and bit 6 of the DMA
Status Register ((B)+54h). When this bit is programmed to '1', the state
of bits 3:1 are preserved when read. Bits 3:1 are only cleared when a '1'
is written to the corresponding bit location. For example, to clear bit 1,
the value of '0000_0010b' should be written to the register. When the DMA
Status Preserve bit is '0', bits 3:1 are cleared when read.
The status register is currently defined to bit 12, not bit 24.
Also, its implementation is reversed: The status is auto-cleared if
the bit is set to 1, and must be cleared explicitly when the bit is
set to 0. This results in spurious interrupts reported by the Linux
kernel, and in some cases even results in stalled SCSI operations.
Set SBAC_STATUS to bit 24 and reverse the logic to fix the problem.
Stefan Hajnoczi [Thu, 15 Feb 2018 11:15:26 +0000 (11:15 +0000)]
block/iscsi: cancel libiscsi task when ABORT TASK TMF completes
The libiscsi iscsi_task_mgmt_async() API documentation says:
abort_task will also cancel the scsi task. The callback for the scsi
task will be invoked with SCSI_STATUS_CANCELLED
The libiscsi implementation does not fulfil this promise. The task's
callback is not invoked and its struct iscsi_pdu remains in the internal
list (effectively leaked).
This patch invokes the libiscsi iscsi_scsi_cancel_task() API to force
the task's callback to be invoked with SCSI_STATUS_CANCELLED when the
ABORT TASK TMF completes and the task's callback hasn't been invoked
yet.
Stefan Hajnoczi [Sat, 3 Feb 2018 06:16:21 +0000 (07:16 +0100)]
block/iscsi: fix ioctl cancel use-after-free
iscsi_aio_cancel() does not increment the request's reference count,
causing a use-after-free when ABORT TASK finishes after the request has
already completed.
There are some additional issues with iscsi_aio_cancel():
1. Several ABORT TASKs may be sent for the same task if
iscsi_aio_cancel() is invoked multiple times. It's better to avoid
this just in case the command identifier is reused.
2. The iscsilun->mutex protection is missing in iscsi_aio_cancel().
Stefan Hajnoczi [Sat, 3 Feb 2018 06:16:20 +0000 (07:16 +0100)]
block/iscsi: take iscsilun->mutex in iscsi_timed_check_events()
Commit d045c466d9e62b4321fadf586d024d54ddfd8bd4 ("iscsi: do not use
aio_context_acquire/release") introduced iscsilun->mutex but appears to
have overlooked iscsi_timed_check_events() when introducing the mutex.
iscsi_service() and iscsi_set_events() must be called with
iscsilun->mutex held.
iscsi_timed_check_events() is invoked from the AioContext and does not
take the mutex.