Thomas Huth [Fri, 6 Oct 2017 13:53:19 +0000 (15:53 +0200)]
tests/cpu-plug-test: Check CPU hot-plugging on ppc64, too
Hot plugging on ppc64 is possible via "device_add", too. Unlike x86,
we must not specify a 'socket-id' and 'thread-id' here, so this needs
to be done with a separate function that just specifies the 'core-id'
during the "device_add".
Peter Maydell [Fri, 19 Jan 2018 16:35:25 +0000 (16:35 +0000)]
Merge remote-tracking branch 'remotes/ehabkost/tags/machine-next-pull-request' into staging
machine queue, 2018-01-19
# gpg: Signature made Fri 19 Jan 2018 16:30:19 GMT
# gpg: using RSA key 0x2807936F984DC5A6
# gpg: Good signature from "Eduardo Habkost <[email protected]>"
# Primary key fingerprint: 5A32 2FD5 ABC4 D3DB ACCF D1AA 2807 936F 984D C5A6
* remotes/ehabkost/tags/machine-next-pull-request:
fw_cfg: fix memory corruption when all fw_cfg slots are used
possible_cpus: add CPUArchId::type field
nvdimm: add 'unarmed' option
nvdimm: add a macro for property "label-size"
hostmem-file: add "align" option
scripts: Remove fixed entries from the device-crash-test
qdev: Check for the availability of a hotplug controller before adding a device
qdev_monitor: Simplify error handling in qdev_device_add()
q35: Allow only supported dynamic sysbus devices
xen: Add only xen-sysdev to dynamic sysbus device list
spapr: Allow only supported dynamic sysbus devices
ppc: e500: Allow only supported dynamic sysbus devices
hw/arm/virt: Allow only supported dynamic sysbus devices
machine: Replace has_dynamic_sysbus with list of allowed devices
numa: fix missing '-numa cpu' in '-help' output
qemu-options: document memory-backend-ram
qemu-options: document missing memory-backend-file options
memfd: remove needless include
memfd: split qemu_memfd_alloc()
Igor Mammedov [Wed, 10 Jan 2018 15:22:50 +0000 (16:22 +0100)]
possible_cpus: add CPUArchId::type field
Remove dependency of possible_cpus on 1st CPU instance,
which decouples configuration data from CPU instances that
are created using that data.
Also later it would be used for enabling early cpu to numa node
configuration at runtime qmp_query_hotpluggable_cpus() should
provide a list of available cpu slots at early stage,
before machine_init() is called and the 1st cpu is created,
so that mgmt might be able to call it and use output to set
numa mapping.
Use MachineClass::possible_cpu_arch_ids() callback to set
cpu type info, along with the rest of possible cpu properties,
to let machine define which cpu type* will be used.
* for SPAPR it will be a spapr core type and for ARM/s390x/x86
a respective descendant of CPUClass.
Move parse_numa_opts() in vl.c after cpu_model is parsed into
cpu_type so that possible_cpu_arch_ids() would know which
cpu_type to use during layout initialization.
Haozhong Zhang [Mon, 11 Dec 2017 07:28:06 +0000 (15:28 +0800)]
nvdimm: add 'unarmed' option
Currently the only vNVDIMM backend can guarantee the guest write
persistence is device DAX on Linux, because no host-side kernel cache
is involved in the guest access to it. The approach to detect whether
the backend is device DAX needs to access sysfs, which may not work
with SELinux.
Instead, we add the 'unarmed' option to device 'nvdimm', so that users
or management utils, which have enough knowledge about the backend,
can control the unarmed flag in guest ACPI NFIT via this option. The
guest Linux NVDIMM driver, for example, will mark the corresponding
vNVDIMM device read-only if the unarmed flag in guest NFIT is set.
The default value of 'unarmed' option is 'off' in order to keep the
backwards compatibility.
Haozhong Zhang [Mon, 11 Dec 2017 07:28:04 +0000 (15:28 +0800)]
hostmem-file: add "align" option
When mmap(2) the backend files, QEMU uses the host page size
(getpagesize(2)) by default as the alignment of mapping address.
However, some backends may require alignments different than the page
size. For example, mmap a device DAX (e.g., /dev/dax0.0) on Linux
kernel 4.13 to an address, which is 4K-aligned but not 2M-aligned,
fails with a kernel message like
Because there is no common approach to get such alignment requirement,
we add the 'align' option to 'memory-backend-file', so that users or
management utils, which have enough knowledge about the backend, can
specify a proper alignment via this option.
Thomas Huth [Thu, 2 Nov 2017 10:10:06 +0000 (11:10 +0100)]
qdev: Check for the availability of a hotplug controller before adding a device
The qdev_unplug() function contains a g_assert(hotplug_ctrl) statement,
so QEMU crashes when the user tries to device_add + device_del a device
that does not have a corresponding hotplug controller. This could be
provoked for a couple of devices in the past (see commit 4c93950659487c7ad
or 84ebd3e8c7d4fe955 for example), and can currently for example also be
triggered like this:
$ s390x-softmmu/qemu-system-s390x -M none -nographic
QEMU 2.10.50 monitor - type 'help' for more information
(qemu) device_add qemu-s390x-cpu,id=x
(qemu) device_del x
**
ERROR:qemu/qdev-monitor.c:872:qdev_unplug: assertion failed: (hotplug_ctrl)
Aborted (core dumped)
So devices clearly need a hotplug controller when they should be usable
with device_add.
The code in qdev_device_add() already checks whether the bus has a proper
hotplug controller, but for devices that do not have a corresponding bus,
there is no appropriate check available yet. In that case we should check
whether the machine itself provides a suitable hotplug controller and
refuse to plug the device if none is available.
Thomas Huth [Thu, 2 Nov 2017 10:10:05 +0000 (11:10 +0100)]
qdev_monitor: Simplify error handling in qdev_device_add()
Instead of doing the clean-ups on errors multiple times, introduce
a jump label at the end of the function that can be used by all
error paths that need this cleanup.
Eduardo Habkost [Sat, 25 Nov 2017 15:16:10 +0000 (13:16 -0200)]
q35: Allow only supported dynamic sysbus devices
The only user-creatable sysbus devices in qemu-system-x86_64 are
amd-iommu, intel-iommu, and xen-backend. xen-backend is handled
by xen_set_dynamic_sysbus(), so we only need to add amd-iommu and
intel-iommu.
Eduardo Habkost [Sat, 25 Nov 2017 15:16:07 +0000 (13:16 -0200)]
ppc: e500: Allow only supported dynamic sysbus devices
platform_bus_create_devtree() already rejects all dynamic sysbus
devices except TYPE_ETSEC_COMMON, so register it as the only
allowed dynamic sysbus device for the ppce500 machine-type.
Eduardo Habkost [Sat, 25 Nov 2017 15:16:06 +0000 (13:16 -0200)]
hw/arm/virt: Allow only supported dynamic sysbus devices
Replace the TYPE_SYS_BUS_DEVICE entry in the allowed sysbus
device list with the two device types that are really supported
by the virt machine: vfio-amd-xgbe and vfio-calxeda-xgmac.
Eduardo Habkost [Sat, 25 Nov 2017 15:16:05 +0000 (13:16 -0200)]
machine: Replace has_dynamic_sysbus with list of allowed devices
The existing has_dynamic_sysbus flag makes the machine accept
every user-creatable sysbus device type on the command-line.
Replace it with a list of allowed device types, so machines can
easily accept some sysbus devices while rejecting others.
To keep exactly the same behavior as before, the existing
has_dynamic_sysbus=true assignments are replaced with a
TYPE_SYS_BUS_DEVICE entry on the allowed list. Other patches
will replace the TYPE_SYS_BUS_DEVICE entries with more specific
lists of devices.
Peter Maydell [Fri, 19 Jan 2018 10:17:20 +0000 (10:17 +0000)]
Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging
pc, pci, virtio: features, fixes, cleanups
A bunch of fixes, cleanus and new features all over the place.
Signed-off-by: Michael S. Tsirkin <[email protected]>
# gpg: Signature made Thu 18 Jan 2018 20:41:03 GMT
# gpg: using RSA key 0x281F0DB8D28D5469
# gpg: Good signature from "Michael S. Tsirkin <[email protected]>"
# gpg: aka "Michael S. Tsirkin <[email protected]>"
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67
# Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469
* remotes/mst/tags/for_upstream: (29 commits)
vhost: remove assertion to prevent crash
vhost-user: fix misaligned access to payload
vhost-user: factor out msg head and payload
tests: acpi: add comments to fetch_rsdt_referenced_tables/data->tables usage
tests: acpi: rename test_acpi_tables()/test_dst_table() to reflect its usage
tests: acpi: init table descriptor in test_dst_table()
tests: acpi: move tested tables array allocation outside of test_acpi_dsdt_table()
x86_iommu: check if machine has PCI bus
x86_iommu: Move machine check to x86_iommu_realize()
vhost-user-test: use init_virtio_dev in multiqueue test
vhost-user-test: make features mask an init_virtio_dev() argument
vhost-user-test: setup virtqueues in all tests
vhost-user-test: extract read-guest-mem test from main loop
vhost-user-test: fix features mask
hw/acpi-build: Make next_base easy to follow
ACPI/unit-test: Add a testcase for RAM allocation in numa node
hw/pci-bridge: fix QEMU crash because of pcie-root-port
intel-iommu: Extend address width to 48 bits
intel-iommu: Redefine macros to enable supporting 48 bit address width
vhost-user: fix multiple queue specification
...
Jay Zhou [Fri, 12 Jan 2018 02:47:57 +0000 (10:47 +0800)]
vhost: remove assertion to prevent crash
QEMU will assert on vhost-user backed virtio device hotplug if QEMU is
using more RAM regions than VHOST_MEMORY_MAX_NREGIONS (for example if
it were started with a lot of DIMM devices).
Fix it by returning error instead of asserting and let callers of
vhost_set_mem_table() handle error condition gracefully.
We currently take a pointer to a misaligned field of a packed structure.
clang reports this as a build warning.
A fix is to keep payload in a separate structure, and access is it
from there using a vectored write.
Igor Mammedov [Fri, 29 Dec 2017 15:16:40 +0000 (16:16 +0100)]
tests: acpi: rename test_acpi_tables()/test_dst_table() to reflect its usage
Main purpose of test_dst_table() is loading a table from QEMU
with checking that checksum in header matches actual one,
rename it reflect main action it performs.
Likewise test_acpi_tables() name is to broad, while the function
only loads tables referenced by RSDT, rename it to reflect it.
Igor Mammedov [Fri, 29 Dec 2017 15:16:38 +0000 (16:16 +0100)]
tests: acpi: move tested tables array allocation outside of test_acpi_dsdt_table()
at best it's confusing that array for list of tables to be tested
against reference tables is allocated within test_acpi_dsdt_table()
and at worst it would just overwrite list of tables if they were
added before test_acpi_dsdt_table().
Move array initialization to test_acpi_one() before we start
processing tables.
Mohammed Gamal [Wed, 29 Nov 2017 12:33:13 +0000 (13:33 +0100)]
x86_iommu: check if machine has PCI bus
Starting qemu with
qemu-system-x86_64 -S -M isapc -device {amd|intel}-iommu
leads to a segfault. The code assume PCI bus is present and
tries to access the bus structure without checking.
Since Intel VT-d and AMDVI should only work with PCI, add a
check for PCI bus and return error if not present.
Dou Liyang [Thu, 14 Dec 2017 04:08:54 +0000 (12:08 +0800)]
ACPI/unit-test: Add a testcase for RAM allocation in numa node
As QEMU supports the memory-less node, it is possible that there is
no RAM in the first numa node(also be called as node0). eg:
... \
-m 128,slots=3,maxmem=1G \
-numa node -numa node,mem=128M \
But, this makes it hard for QEMU to build a known-to-work ACPI SRAT
table. Only fixing it is not enough.
Add a testcase for this situation to make sure the ACPI table is
correct for guest.
Marcel Apfelbaum [Wed, 10 Jan 2018 19:09:09 +0000 (21:09 +0200)]
hw/pci-bridge: fix QEMU crash because of pcie-root-port
If we try to use more pcie_root_ports then available slots
and an IO hint is passed to the port, QEMU crashes because
we try to init the "IO hint" capability even if the device
is not created.
Fix it by checking for error before adding the capability,
so QEMU can fail gracefully.
The current implementation of Intel IOMMU code only supports 39 bits
iova address width. This patch provides a new parameter (x-aw-bits)
for intel-iommu to extend its address width to 48 bits but keeping the
default the same (39 bits). The reason for not changing the default
is to avoid potential compatibility problems with live migration of
intel-iommu enabled QEMU guest. The only valid values for 'x-aw-bits'
parameter are 39 and 48.
After enabling larger address width (48), we should be able to map
larger iova addresses in the guest. For example, a QEMU guest that
is configured with large memory ( >=1TB ). To check whether 48 bits
aw is enabled, we can grep in the guest dmesg output with line:
"DMAR: Host address width 48".
intel-iommu: Redefine macros to enable supporting 48 bit address width
The current implementation of Intel IOMMU code only supports 39 bits
host/iova address width so number of macros use hard coded values based
on that. This patch is to redefine them so they can be used with
variable address widths. This patch doesn't add any new functionality
but enables adding support for 48 bit address width.
Gal Hammer [Sun, 14 Jan 2018 10:06:56 +0000 (12:06 +0200)]
virtio: improve virtio devices initialization time
The loading time of a VM is quite significant when its virtio
devices use a large amount of virt-queues (e.g. a virtio-serial
device with max_ports=511). Most of the time is spend in the
creation of all the required event notifiers (ioeventfd and memory
regions).
This patch pack all the changes to the memory regions in a
single memory transaction.
Gal Hammer [Sun, 14 Jan 2018 10:06:55 +0000 (12:06 +0200)]
virtio: postpone the execution of event_notifier_cleanup function
Use the EventNotifier's cleanup callback function to execute the
event_notifier_cleanup function after kvm unregistered the eventfd.
This change supports running the virtio_bus_set_host_notifier
function inside a memory region transaction. Otherwise, a closed
fd is sent to kvm, which results in a failure.
Changpeng Liu [Thu, 4 Jan 2018 01:53:34 +0000 (09:53 +0800)]
contrib/vhost-user-blk: introduce a vhost-user-blk sample application
This commit introduces a vhost-user-blk backend device, it uses UNIX
domain socket to communicate with QEMU. The vhost-user-blk sample
application should be used with QEMU vhost-user-blk-pci device.
To use it, complie with:
make vhost-user-blk
and start like this:
vhost-user-blk -b /dev/sdb -s /path/vhost.socket
Changpeng Liu [Thu, 4 Jan 2018 01:53:33 +0000 (09:53 +0800)]
contrib/libvhost-user: enable virtio config space messages
Enable VHOST_USER_GET_CONFIG/VHOST_USER_SET_CONFIG messages in
libvhost-user library, users can implement their own I/O target
based on the library. This enable the virtio config space delivered
between QEMU host device and the I/O target.
Changpeng Liu [Thu, 4 Jan 2018 01:53:32 +0000 (09:53 +0800)]
vhost-user-blk: introduce a new vhost-user-blk host device
This commit introduces a new vhost-user device for block, it uses a
chardev to connect with the backend, same with Qemu virito-blk device,
Guest OS still uses the virtio-blk frontend driver.
To use it, start QEMU with command line like this:
Users can use different parameters for `num-queues` and `bootindex`.
Different with exist Qemu virtio-blk host device, it makes more easy
for users to implement their own I/O processing logic, such as all
user space I/O stack against hardware block device. It uses the new
vhost messages(VHOST_USER_GET_CONFIG) to get block virtio config
information from backend process.
Changpeng Liu [Thu, 4 Jan 2018 01:53:31 +0000 (09:53 +0800)]
vhost-user: add new vhost user messages to support virtio config space
Add VHOST_USER_GET_CONFIG/VHOST_USER_SET_CONFIG messages which can be
used for live migration of vhost user devices, also vhost user devices
can benefit from the messages to get/set virtio config space from/to the
I/O target. For the purpose to support virtio config space change,
VHOST_USER_SLAVE_CONFIG_CHANGE_MSG message is added as the event notifier
in case virtio config space change in the slave I/O target.
Peter Maydell [Thu, 18 Jan 2018 12:59:22 +0000 (12:59 +0000)]
Merge remote-tracking branch 'remotes/ehabkost/tags/x86-pull-request' into staging
x86 queue, 2018-01-17
Highlight: new CPU models that expose CPU features that guests
can use to mitigate CVE-2017-5715 (Spectre variant #2).
# gpg: Signature made Thu 18 Jan 2018 02:00:03 GMT
# gpg: using RSA key 0x2807936F984DC5A6
# gpg: Good signature from "Eduardo Habkost <[email protected]>"
# Primary key fingerprint: 5A32 2FD5 ABC4 D3DB ACCF D1AA 2807 936F 984D C5A6
* remotes/ehabkost/tags/x86-pull-request:
i386: Add EPYC-IBPB CPU model
i386: Add new -IBRS versions of Intel CPU models
i386: Add FEAT_8000_0008_EBX CPUID feature word
i386: Add spec-ctrl CPUID bit
i386: Add support for SPEC_CTRL MSR
i386: Change X86CPUDefinition::model_id to const char*
target/i386: add clflushopt to "Skylake-Server" cpu model
pc: add 2.12 machine types
Peter Maydell [Thu, 18 Jan 2018 11:46:27 +0000 (11:46 +0000)]
Merge remote-tracking branch 'remotes/dgibson/tags/ppc-for-2.12-20180117' into staging
ppc patch queue 2017-01-17
Another pull request for ppc related patches. The most interesting
thing here is the new capabilities framework for the pseries machine
type. This gives us better handling of several existing
incompatibilities between TCG, PR and HV KVM, as well as new ones that
arise with POWER9. Further, it will allow reasonable handling of the
advertisement of features necessary to mitigate the recent CVEs
(Spectre and Meltdown).
In addition there's:
* Improvide handling of different "vsmt" modes
* Significant enhancements to the "pnv" machine type
* Assorted other bugfixes
* remotes/dgibson/tags/ppc-for-2.12-20180117: (22 commits)
target-ppc: Fix booke206 tlbwe TLB instruction
target/ppc: add support for POWER9 HILE
ppc/pnv: change initrd address
ppc/pnv: fix XSCOM core addressing on POWER9
ppc/pnv: introduce pnv*_is_power9() helpers
ppc/pnv: change core mask for POWER9
ppc/pnv: use POWER9 DD2 processor
tests/boot-serial-test: fix powernv support
ppc/pnv: Update skiboot firmware image
spapr: Adjust default VSMT value for better migration compatibility
spapr: Allow some cases where we can't set VSMT mode in the kernel
target/ppc: Clarify compat mode max_threads value
ppc: Change Power9 compat table to support at most 8 threads/core
spapr: Remove unnecessary 'options' field from sPAPRCapabilityInfo
hw/ppc/spapr_caps: Rework spapr_caps to use uint8 internal representation
spapr: Handle Decimal Floating Point (DFP) as an optional capability
spapr: Handle VMX/VSX presence as an spapr capability flag
target/ppc: Clean up probing of VMX, VSX and DFP availability on KVM
spapr: Validate capabilities on migration
spapr: Treat Hardware Transactional Memory (HTM) as an optional capability
...
John Arbuckle [Mon, 8 Jan 2018 18:07:07 +0000 (13:07 -0500)]
cocoa.m: Fix scroll wheel support
When using a mouse's scroll wheel in a guest with
the cocoa front-end, the mouse pointer moves up
and down instead of scrolling the window. This
patch fixes this problem.
Eric Blake [Wed, 10 Jan 2018 23:08:24 +0000 (17:08 -0600)]
nbd/server: Add helper functions for parsing option payload
Rather than making every callsite perform length sanity checks
and error reporting, add the helper functions nbd_opt_read()
and nbd_opt_drop() that use the length stored in the client
struct; also add an assertion that optlen is 0 before any
option (ie. any previous option was fully handled), complementing
the assertion added in an earlier patch that optlen is 0 after
all negotiation completes.
Note that the call in nbd_negotiate_handle_export_name() does
not use the new helper (in part because the server cannot
reply to NBD_OPT_EXPORT_NAME - it either succeeds or the
connection drops).
Eric Blake [Wed, 10 Jan 2018 23:08:22 +0000 (17:08 -0600)]
nbd/server: Better error for NBD_OPT_EXPORT_NAME failure
When a client abruptly disconnects before we've finished reading
the name sent with NBD_OPT_EXPORT_NAME, we are better off logging
the failure as EIO (we can't communicate with the client), rather
than EINVAL (the client sent bogus data).
Instead of passing currently negotiating option and its length to
many of negotiation functions let's just store them on NBDClient
struct to be state-variables of negotiation phase.
This unifies semantics of negotiation functions and allows
tracking changes of remaining option length in future patches.
Asssert that optlen is back to 0 after negotiation (including
old-style connections which don't negotiate), although we need
more patches before we can assert optlen is 0 between options
during negotiation.
Eric Blake [Wed, 10 Jan 2018 23:08:20 +0000 (17:08 -0600)]
nbd/server: Hoist nbd_reject_length() earlier
No semantic change, but will make it easier for an upcoming patch
to refactor code without having to add forward declarations. Fix
a poor comment while at it.
Eduardo Habkost [Tue, 9 Jan 2018 15:45:17 +0000 (13:45 -0200)]
i386: Add new -IBRS versions of Intel CPU models
The new MSR IA32_SPEC_CTRL MSR was introduced by a recent Intel
microcode updated and can be used by OSes to mitigate
CVE-2017-5715. Unfortunately we can't change the existing CPU
models without breaking existing setups, so users need to
explicitly update their VM configuration to use the new *-IBRS
CPU model if they want to expose IBRS to guests.
The new CPU models are simple copies of the existing CPU models,
with just CPUID_7_0_EDX_SPEC_CTRL added and model_id updated.
Eduardo Habkost [Tue, 9 Jan 2018 15:45:13 +0000 (13:45 -0200)]
i386: Change X86CPUDefinition::model_id to const char*
It is valid to have a 48-character model ID on CPUID, however the
definition of X86CPUDefinition::model_id is char[48], which can
make the compiler drop the null terminator from the string.
If a CPU model happens to have 48 bytes on model_id, "-cpu help"
will print garbage and the object_property_set_str() call at
x86_cpu_load_def() will read data outside the model_id array.
We could increase the array size to 49, but this would mean the
compiler would not issue a warning if a 49-char string is used by
mistake for model_id.
To make things simpler, simply change model_id to be const char*,
and validate the string length using an assert() on
x86_register_cpudef_type().
Haozhong Zhang [Tue, 19 Dec 2017 03:37:30 +0000 (11:37 +0800)]
target/i386: add clflushopt to "Skylake-Server" cpu model
CPUID_7_0_EBX_CLFLUSHOPT is missed in current "Skylake-Server" cpu
model. Add it to "Skylake-Server" cpu model on pc-i440fx-2.12 and
pc-q35-2.12. Keep it disabled in "Skylake-Server" cpu model on older
machine types.
Luc MICHEL [Mon, 15 Jan 2018 09:32:20 +0000 (10:32 +0100)]
target-ppc: Fix booke206 tlbwe TLB instruction
When overwritting a valid TLB entry with a new one, the previous page
were not flushed in QEMU TLB, leading to incoherent mapping. This commit
fixes this.
Cédric Le Goater [Mon, 15 Jan 2018 18:04:05 +0000 (19:04 +0100)]
ppc/pnv: change initrd address
When skiboot starts, it first clears the CPU structs for all possible
CPUs on a system :
for (i = 0; i <= cpu_max_pir; i++)
memset(&cpu_stacks[i].cpu, 0, sizeof(struct cpu_thread));
On POWER9, cpu_max_pir is quite big, 0x7fff, and the skiboot cpu_stacks
array overlaps with the memory region in which QEMU maps the initramfs
file. Move it upwards in memory to keep it safe.
David Gibson [Mon, 15 Jan 2018 06:51:33 +0000 (17:51 +1100)]
spapr: Adjust default VSMT value for better migration compatibility
fa98fbfc "PC: KVM: Support machine option to set VSMT mode" introduced the
"vsmt" parameter for the pseries machine type, which controls the spacing
of the vcpu ids of thread 0 for each virtual core. This was done to bring
some consistency and stability to how that was done, while still allowing
backwards compatibility for migration and otherwise.
The default value we used for vsmt was set to the max of the host's
advertised default number of threads and the number of vthreads per vcore
in the guest. This was done to continue running without extra parameters
on older KVM versions which don't allow the VSMT value to be changed.
Unfortunately, even that smaller than before leakage of host configuration
into guest visible configuration still breaks things. Specifically a guest
with 4 (or less) vthread/vcore will get a different vsmt value when
running on a POWER8 (vsmt==8) and POWER9 (vsmt==4) host. That means the
vcpu ids don't line up so you can't migrate between them, though you should
be able to.
Long term we really want to make vsmt == smp_threads for sufficiently
new machine types. However, that means that qemu will then require a
sufficiently recent KVM (one which supports changing VSMT) - that's still
not widely enough deployed to be really comfortable to do.
In the meantime we need some default that will work as often as
possible. This patch changes that default to 8 in all circumstances.
This does change guest visible behaviour (including for existing
machine versions) for many cases - just not the most common/important
case.
Following is case by case justification for why this is still the least
worst option. Note that any of the old behaviours can still be duplicated
after this patch, it's just that it requires manual intervention by
setting the vsmt property on the command line.
KVM HV on POWER8 host:
This is the overwhelmingly common case in production setups, and is
unchanged by design. POWER8 hosts will advertise a default VSMT mode
of 8, and > 8 vthreads/vcore isn't permitted
KVM HV on POWER7 host:
Will break, but POWER7s allowing KVM were never released to the public.
KVM HV on POWER9 host:
Not yet released to the public, breaking this now will reduce other
breakage later.
KVM HV on PowerPC 970:
Will theoretically break it, but it was barely supported to begin with
and already required various user visible hacks to work. Also so old
that I just don't care.
TCG:
This is the nastiest one; it means migration of TCG guests (without
manual vsmt setting) will break. Since TCG is rarely used in production
I think this is worth it for the other benefits. It does also remove
one more barrier to TCG<->KVM migration which could be interesting for
debugging applications.
KVM PR:
As with TCG, this will break migration of existing configurations,
without adding extra manual vsmt options. As with TCG, it is rare in
production so I think the benefits outweigh breakages.
David Gibson [Tue, 16 Jan 2018 04:37:37 +0000 (15:37 +1100)]
spapr: Allow some cases where we can't set VSMT mode in the kernel
At present if we require a vsmt mode that's not equal to the kernel's
default, and the kernel doesn't let us change it (e.g. because it's an old
kernel without support) then we always fail.
But in fact we can cope with the kernel having a different vsmt as long as
a) it's >= the actual number of vthreads/vcore (so that guest threads
that are supposed to be on the same core act like it)
b) it's a submultiple of the requested vsmt mode (so that guest threads
spaced by the vsmt value will act like they're on different cores)
Allowing this case gives us a bit more freedom to adjust the vsmt behaviour
without breaking existing cases.
David Gibson [Mon, 15 Jan 2018 05:40:23 +0000 (16:40 +1100)]
target/ppc: Clarify compat mode max_threads value
We recently had some discussions that were sidetracked for a while, because
nearly everyone misapprehended the purpose of the 'max_threads' field in
the compatiblity modes table. It's all about guest expectations, not host
expectations or support (that's handled elsewhere).
In an attempt to avoid a repeat of that confusion, rename the field to
'max_vthreads' and add an explanatory comment.
hw/ppc/spapr_caps: Rework spapr_caps to use uint8 internal representation
Currently spapr_caps are tied to boolean values (on or off). This patch
reworks the caps so that they can have any uint8 value. This allows more
capabilities with various values to be represented in the same way
internally. Capabilities are numbered in ascending order. The internal
representation of capability values is an array of uint8s in the
sPAPRMachineState, indexed by capability number.
Capabilities can have their own name, description, options, getter and
setter functions, type and allow functions. They also each have their own
section in the migration stream. Capabilities are only migrated if they
were explictly set on the command line, with the assumption that
otherwise the default will match.
On migration we ensure that the capability value on the destination
is greater than or equal to the capability value from the source. So
long at this remains the case then the migration is considered
compatible and allowed to continue.
This patch implements generic getter and setter functions for boolean
capabilities. It also converts the existings cap-htm, cap-vsx and
cap-dfp capabilities to this new format.
David Gibson [Mon, 11 Dec 2017 06:34:30 +0000 (17:34 +1100)]
spapr: Handle Decimal Floating Point (DFP) as an optional capability
Decimal Floating Point has been available on POWER7 and later (server)
cpus. However, it can be disabled on the hypervisor, meaning that it's
not available to guests.
We currently handle this by conditionally advertising DFP support in the
device tree depending on whether the guest CPU model supports it - which
can also depend on what's allowed in the host for -cpu host. That can lead
to confusion on migration, since host properties are silently affecting
guest visible properties.
This patch handles it by treating it as an optional capability for the
pseries machine type.
David Gibson [Thu, 7 Dec 2017 06:08:47 +0000 (17:08 +1100)]
spapr: Handle VMX/VSX presence as an spapr capability flag
We currently have some conditionals in the spapr device tree code to decide
whether or not to advertise the availability of the VMX (aka Altivec) and
VSX vector extensions to the guest, based on whether the guest cpu has
those features.
This can lead to confusion and subtle failures on migration, since it makes
a guest visible change based only on host capabilities. We now have a
better mechanism for this, in spapr capabilities flags, which explicitly
depend on user options rather than host capabilities.
Rework the advertisement of VSX and VMX based on a new VSX capability. We
no longer bother with a conditional for VMX support, because every CPU
that's ever been supported by the pseries machine type supports VMX.
NOTE: Some userspace distributions (e.g. RHEL7.4) already rely on
availability of VSX in libc, so using cap-vsx=off may lead to a fatal
SIGILL in init.
David Gibson [Mon, 11 Dec 2017 06:41:34 +0000 (17:41 +1100)]
target/ppc: Clean up probing of VMX, VSX and DFP availability on KVM
When constructing the "host" cpu class we modify whether the VMX and VSX
vector extensions and DFP (Decimal Floating Point) are available
based on whether KVM can support those instructions. This can depend on
policy in the host kernel as well as on the actual host cpu capabilities.
However, the way we probe for this is not very nice: we explicitly check
the host's device tree. That works in practice, but it's not really
correct, since the device tree is a property of the host kernel's platform
which we don't really know about. We get away with it because the only
modern POWER platforms happen to encode VMX, VSX and DFP availability in
the device tree in the same way.
Arguably we should have an explicit KVM capability for this, but we haven't
needed one so far. Barring specific KVM policies which don't yet exist,
each of these instruction classes will be available in the guest if and
only if they're available in the qemu userspace process. We can determine
that from the ELF AUX vector we're supplied with.
Once reworked like this, there are no more callers for kvmppc_get_vmx() and
kvmppc_get_dfp() so remove them.
David Gibson [Mon, 11 Dec 2017 04:09:37 +0000 (15:09 +1100)]
spapr: Validate capabilities on migration
Now that the "pseries" machine type implements optional capabilities (well,
one so far) there's the possibility of having different capabilities
available at either end of a migration. Although arguably a user error,
it would be nice to catch this situation and fail as gracefully as we can.
This adds code to migrate the capabilities flags. These aren't pulled
directly into the destination's configuration since what the user has
specified on the destination command line should take precedence. However,
they are checked against the destination capabilities.
If the source was using a capability which is absent on the destination,
we fail the migration, since that could easily cause a guest crash or other
bad behaviour. If the source lacked a capability which is present on the
destination we warn, but allow the migration to proceed.
David Gibson [Mon, 11 Dec 2017 02:10:44 +0000 (13:10 +1100)]
spapr: Treat Hardware Transactional Memory (HTM) as an optional capability
This adds an spapr capability bit for Hardware Transactional Memory. It is
enabled by default for pseries-2.11 and earlier machine types. with POWER8
or later CPUs (as it must be, since earlier qemu versions would implicitly
allow it). However it is disabled by default for the latest pseries-2.12
machine type.
This means that with the latest machine type, HTM will not be available,
regardless of CPU, unless it is explicitly enabled on the command line.
That change is made on the basis that:
* This way running with -M pseries,accel=tcg will start with whatever cpu
and will provide the same guest visible model as with accel=kvm.
- More specifically, this means existing make check tests don't have
to be modified to use cap-htm=off in order to run with TCG
* We hope to add a new "HTM without suspend" feature in the not too
distant future which could work on both POWER8 and POWER9 cpus, and
could be enabled by default.
* Best guesses suggest that future POWER cpus may well only support the
HTM-without-suspend model, not the (frankly, horribly overcomplicated)
POWER8 style HTM with suspend.
* Anecdotal evidence suggests problems with HTM being enabled when it
wasn't wanted are more common than being missing when it was.
David Gibson [Thu, 7 Dec 2017 23:35:35 +0000 (10:35 +1100)]
spapr: Capabilities infrastructure
Because PAPR is a paravirtual environment access to certain CPU (or other)
facilities can be blocked by the hypervisor. PAPR provides ways to
advertise in the device tree whether or not those features are available to
the guest.
In some places we automatically determine whether to make a feature
available based on whether our host can support it, in most cases this is
based on limitations in the available KVM implementation.
Although we correctly advertise this to the guest, it means that host
factors might make changes to the guest visible environment which is bad:
as well as generaly reducing reproducibility, it means that a migration
between different host environments can easily go bad.
We've mostly gotten away with it because the environments considered mature
enough to be well supported (basically, KVM on POWER8) have had consistent
feature availability. But, it's still not right and some limitations on
POWER9 is going to make it more of an issue in future.
This introduces an infrastructure for defining "sPAPR capabilities". These
are set by default based on the machine version, masked by the capabilities
of the chosen cpu, but can be overriden with machine properties.
The intention is at reset time we verify that the requested capabilities
can be supported on the host (considering TCG, KVM and/or host cpu
limitations). If not we simply fail, rather than silently modifying the
advertised featureset to the guest.
This does mean that certain configurations that "worked" may now fail, but
such configurations were already more subtly broken.
target/ppc: Yet another fix for KVM-HV HPTE accessors
As stated in the 1ad9f0a464fe commit log, the returned entries are not
a whole PTEG. It was not a problem before 1ad9f0a464fe as it would read
a single record assuming it contains a whole PTEG but now the code tries
reading the entire PTEG and "if ((n - i) < invalid)" produces negative
values which then are converted to size_t for memset() and that throws
seg fault.
Peter Maydell [Tue, 16 Jan 2018 17:36:39 +0000 (17:36 +0000)]
Merge remote-tracking branch 'remotes/rth/tags/pull-tcg-20180116' into staging
Queued TCG patches
# gpg: Signature made Tue 16 Jan 2018 16:24:50 GMT
# gpg: using RSA key 0x64DF38E8AF7E215F
# gpg: Good signature from "Richard Henderson <[email protected]>"
# Primary key fingerprint: 7A48 1E78 868B 4DB6 A85A 05C0 64DF 38E8 AF7E 215F
* remotes/rth/tags/pull-tcg-20180116:
tcg/ppc: Allow a 32-bit offset to the constant pool
tcg/ppc: Support tlb offsets larger than 64k
tcg/arm: Support tlb offsets larger than 64k
tcg/arm: Fix double-word comparisons
tcg/ppc: Allow a 32-bit offset to the constant pool
We recently relaxed the limit of the number of opcodes that can
appear in a TranslationBlock. In certain cases this has resulted
in relocation overflow.