Objects should not be "delayed" without a reason, as the previous
commit demonstrates. The remaining ones have reasons. State them.
and demand future ones come with such a statement.
qemu-system-FOO's main() acts on command line arguments in its own
idiosyncratic order. There's not much method to its madness.
Whenever we find a case where one kind of command line argument needs
to refer to something created for another kind later, we rejigger the
order.
Recent commit cda4aa9a5a "vl: Create block backends before setting
machine properties" was such a rejigger. Block backends are now
created before "delayed" objects. This broke persistent reservation
management. Reproducer:
$ qemu-system-x86_64 -object pr-manager-helper,id=pr-helper0,path=/tmp/pr-helper0.sock-drive -drive file=/dev/mapper/crypt,file.pr-manager=pr-helper0,format=raw,if=none,id=drive-scsi0-0-0-2
qemu-system-x86_64: -drive file=/dev/mapper/crypt,file.pr-manager=pr-helper0,format=raw,if=none,id=drive-scsi0-0-0-2: No persistent reservation manager with id 'pr-helper0'
The delayed pr-manager-helper object is created too late for use by
-drive or -blockdev. Normal objects are still created in time.
pr-manager-helper has always been a delayed object (commit 7c9e527659
"scsi, file-posix: add support for persistent reservation
management"). Turns out there's no real reason for that. Make it a
normal object.
Alex Williamson [Tue, 14 May 2019 20:14:41 +0000 (14:14 -0600)]
q35: Revert to kernel irqchip
Commit b2fc91db8447 ("q35: set split kernel irqchip as default") changed
the default for the pc-q35-4.0 machine type to use split irqchip, which
turned out to have disasterous effects on vfio-pci INTx support. KVM
resampling irqfds are registered for handling these interrupts, but
these are non-functional in split irqchip mode. We can't simply test
for split irqchip in QEMU as userspace handling of this interrupt is a
significant performance regression versus KVM handling (GeForce GPUs
assigned to Windows VMs are non-functional without forcing MSI mode or
re-enabling kernel irqchip).
The resolution is to revert the change in default irqchip mode in the
pc-q35-4.1 machine and create a pc-q35-4.0.1 machine for the 4.0-stable
branch. The qemu-q35-4.0 machine type should not be used in vfio-pci
configurations for devices requiring legacy INTx support without
explicitly modifying the VM configuration to use kernel irqchip.
Paolo Bonzini [Fri, 15 Mar 2019 09:16:20 +0000 (10:16 +0100)]
ci: store Patchew configuration in the tree
Patchew cannot yet retrieve the configuration from the QEMU Git tree, but
this is planned. In the meanwhile, let's start storing it as YAML
so that the Patchew configuration (currently accessible only to administrators)
is public and documented.
Paolo Bonzini [Mon, 18 Mar 2019 14:06:50 +0000 (15:06 +0100)]
libqos: i2c: move address into QI2CDevice
This removes the hardcoded I2C address from the tests. The address
is passed via QOSGraphEdgeOptions to i2c_device_create and stored
in the QI2CDevice.
The i2c_send and i2c_recv functions, along with their wrappers,
therefore, can be changed to take a QI2CDevice rather than an
adapter/address pair.
Paolo Bonzini [Mon, 18 Mar 2019 12:48:23 +0000 (13:48 +0100)]
libqos: convert I2C to qgraph
Create an i2c-bus interface, corresponding to the I2CAdapter struct.
Wrap IMXI2C and OMAPI2C with a QOSGraphObject, and add the get_driver
function to retrieve the I2CAdapter.
The conversion is still not complete; for simplicity, i2c_recv and
i2c_send (along with their wrappers) still take an adapter/address
pair. Fixing that would be complicated until the tests are converted
to qgraph, so it is left for after the conversion.
Paolo Bonzini [Mon, 18 Mar 2019 16:12:25 +0000 (17:12 +0100)]
libqos: split I2CAdapter initialization and allocation
Provide *_init functions that populate an I2CAdapter struct without
allocating one, and make the existing *_create functions wrap them.
Because in the new setup *_create might return a pointer inside the
IMXI2C or OMAPI2C struct, create companion *_free functions to go
back to the outer pointer.
All this is temporary until allocation will be handled entirely by
qgraph.
Paolo Bonzini [Mon, 18 Mar 2019 15:49:59 +0000 (16:49 +0100)]
libqos: fix omap-i2c receiving more than 4 bytes
If more than 4 bytes are received, the FIFO cannot host the entire
contents of the transfer and STP will be nonzero before entering
the transfer loop. Also, CNT will contain the number of bytes
left to be transferred instead of the total number of bytes in
the transfer.
(Reverse engineered from the omap_i2c.c source code; no available
datasheet).
Paolo Bonzini [Mon, 18 Mar 2019 14:09:51 +0000 (15:09 +0100)]
libqos: move common i2c code to libqos
The functions to read/write 8-bit or 16-bit registers are the same
in tmp105 and pca9552 tests, and in fact they are a special case of
"read block"/"write block" functionality; read block in turn is used
in ds1338-test.
Move everything inside libqos-test, removing the duplication. Account
for the small differences by adding to tmp105-test.c the "read register
after writing" behavior that is specific to it.
Paolo Bonzini [Mon, 18 Mar 2019 16:14:12 +0000 (17:14 +0100)]
qgraph: fix qos_node_contains with options
Currently, if qos_node_contains was passed options, it would still
create an edge without any options. Instead, in that case
NULL acts as a terminator.
Paolo Bonzini [Mon, 18 Mar 2019 13:48:47 +0000 (14:48 +0100)]
qgraph: allow extra_device_opts on contains nodes
Allow choosing the bus that the device will be placed on, in case
the machine has more than one. Otherwise, the bus may not match
the base address of the controller we attach it to.
Li Qiang [Fri, 10 May 2019 16:43:49 +0000 (09:43 -0700)]
edu: uses uint64_t in dma operation
The dma related variable dma.dst/src/cnt is dma_addr_t, it is
uint64_t in x64 platform. Change these usage from uint32_to
uint64_t to avoid trancation in edu_dma_timer.
Li Qiang [Fri, 10 May 2019 16:43:47 +0000 (09:43 -0700)]
edu: mmio: allow 64-bit access
The edu spec says the MMIO area can be accessed by 64-bit.
However currently the 'max_access_size' is not so the MMIO
access dispatch can only access 32-bit one time. This patch fixes
this to respect the spec.
Wanpeng Li [Tue, 14 May 2019 06:06:39 +0000 (14:06 +0800)]
i386: Enable IA32_MISC_ENABLE MWAIT bit when exposing mwait/monitor
The CPUID.01H:ECX[bit 3] ought to mirror the value of the MSR
IA32_MISC_ENABLE MWAIT bit and as userspace has control of them
both, it is userspace's job to configure both bits to match on
the initial setup.
vl: make -accel help to list enabled accelerators only
Currently, -accel help shows all possible accelerators regardless
if they are enabled in the binary or not. That is a different
semantic from -cpu and -machine helps, for example. So this change
makes it to list only the accelerators which support is compiled
in the binary target.
Note that it does not check if the accelerator is enabled in the
host, so the help message's header was rewritten to emphasize
that. Also qtest is not displayed given that it is used for
internal testing purpose only.
Paolo Bonzini [Thu, 14 Mar 2019 18:20:07 +0000 (19:20 +0100)]
test-thread-pool: be more reliable
There is a rare race between the atomic_cmpxchg and
bdrv_aio_cancel/bdrv_aio_cancel_async invocations. Detect it, the
only sensible we can do about it is to exit long_cb immediately.
Peter Maydell [Mon, 3 Jun 2019 09:25:12 +0000 (10:25 +0100)]
Merge remote-tracking branch 'remotes/amarkovic/tags/mips-queue-jun-1-2019' into staging
MIPS queue for June 1st, 2019
# gpg: Signature made Sat 01 Jun 2019 19:20:47 BST
# gpg: using RSA key D4972A8967F75A65
# gpg: Good signature from "Aleksandar Markovic <[email protected]>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 8526 FBF1 5DA3 811F 4A01 DD75 D497 2A89 67F7 5A65
* remotes/amarkovic/tags/mips-queue-jun-1-2019:
target/mips: Improve performance of certain MSA instructions
target/mips: Clean up lmi_helper.c
target/mips: Clean up dsp_helper.c
tests/tcg: target/mips: Add tests for MSA bit set instructions
target/mips: Amend and cleanup MSA TCG tests
target/mips: Add emulation of MMI instruction PCPYUD
target/mips: Add emulation of MMI instruction PCPYLD
target/mips: Add emulation of MMI instruction PCPYH
tests/tcg: target/mips: Add tests for MSA bit set instructions
Add tests for MSA bit set instructions. This includes following
instructions:
* BCLR.B - clear bit (bytes)
* BCLR.H - clear bit (halfwords)
* BCLR.W - clear bit (words)
* BCLR.D - clear bit (doublewords)
* BNEG.B - negate bit (bytes)
* BNEG.H - negate bit (halfwords)
* BNEG.W - negate bit (words)
* BNEG.D - negate bit (doublewords)
* BSET.B - set bit (bytes)
* BSET.H - set bit (halfwords)
* BSET.W - set bit (words)
* BSET.D - set bit (doublewords)
Add missing bits and peaces of the tests of the emulation of certain
MSA (non-immediate variants): some tests were missing two last cases;
some instructions were missing wrappers; some test included wrong
headers; some tests were missing altogether; updated some copywright
preambles; do several other minor cleanups.
Peter Maydell [Thu, 30 May 2019 14:08:00 +0000 (15:08 +0100)]
Merge remote-tracking branch 'remotes/dgibson/tags/ppc-for-4.1-20190529' into staging
ppc patch queue 2019-05-29
Next pull request against qemu-4.1. Highlights:
* KVM accelerated support for the XIVE interrupt controller in PAPR
guests
* A number of TCG vector fixes
* Fixes for the PReP / 40p machine
* Improvements to make check-tcg test coverage
Other than that it's just a bunch of assorted fixes, cleanups and
minor improvements.
This supersedes both the pull request dated 2019-05-21 and the one
dated 2019-05-22. I've dropped one hunk which I think may have caused
the check-tcg failure that Peter saw (by enabling the ppc64abi32
build, which I think has been broken for ages). I'm not entirely
certain, since I haven't reproduced exactly the same failure.
* remotes/dgibson/tags/ppc-for-4.1-20190529: (44 commits)
ppc/pnv: add dummy XSCOM registers for PRD initialization
ppc/pnv: introduce new skiboot platform properties
spapr: Don't migrate the hpt_maxpagesize cap to older machine types
spapr: change default interrupt mode to 'dual'
spapr/xive: fix multiple resets when using the 'dual' interrupt mode
docs: provide documentation on the POWER9 XIVE interrupt controller
spapr/irq: add KVM support to the 'dual' machine
ppc/xics: fix irq priority in ics_set_irq_type()
spapr/irq: initialize the IRQ device only once
spapr/irq: introduce a spapr_irq_init_device() helper
spapr: check for the activation of the KVM IRQ device
spapr: introduce routines to delete the KVM IRQ device
sysbus: add a sysbus_mmio_unmap() helper
spapr/xive: activate KVM support
spapr/xive: add migration support for KVM
spapr/xive: introduce a VM state change handler
spapr/xive: add state synchronization with KVM
spapr/xive: add hcall support when under KVM
spapr/xive: add KVM support
spapr: Print out extra hints when CAS negotiation of interrupt mode fails
...
* remotes/kraxel/tags/usb-20190529-pull-request:
usb-tablet: fix serial compat property
usb-hub: emulate per port power switching
usb-hub: add usb_hub_port_update()
usb-hub: add helpers to update port state
usb-hub: make number of ports runtime-configurable
usb-hub: tweak feature names
usb-host: avoid libusb_set_configuration calls
usb-host: skip reset for untouched devices
usb: call reset handler before updating state
* remotes/jnsnow/tags/bitmaps-pull-request:
iotests: test external snapshot with bitmap copying
qapi: support external bitmaps in block-dirty-bitmap-merge
migration/dirty-bitmaps: change bitmap enumeration method
Peter Maydell [Thu, 30 May 2019 10:17:56 +0000 (11:17 +0100)]
Merge remote-tracking branch 'remotes/maxreitz/tags/pull-block-2019-05-28' into staging
Block patches:
- qcow2: Use threads for encrypted I/O
- qemu-img rebase: Optimizations
- backup job: Allow any source node, and some refactoring
- Some general simplifications in the block layer
* remotes/maxreitz/tags/pull-block-2019-05-28: (21 commits)
blockdev: loosen restrictions on drive-backup source node
qcow2-bitmap: initialize bitmap directory alignment
qcow2: skip writing zero buffers to empty COW areas
qemu-img: rebase: Reuse in-chain BlockDriverState
qemu-img: rebase: Reduce reads on in-chain rebase
qemu-img: rebase: Reuse parent BlockDriverState
block: Make bdrv_root_attach_child() unref child_bs on failure
block: Use bdrv_unref_child() for all children in bdrv_close()
block/backup: refactor: split out backup_calculate_cluster_size
block/backup: unify different modes code path
block/backup: refactor and tolerate unallocated cluster skipping
block/backup: move to copy_bitmap with granularity
block/backup: simplify backup_incremental_init_copy_bitmap
qcow2: do encryption in threads
qcow2: bdrv_co_pwritev: move encryption code out of the lock
qcow2: qcow2_co_preadv: improve locking
qcow2-threads: split out generic path
qcow2-threads: qcow2_co_do_compress: protect queuing by mutex
qcow2-threads: use thread_pool_submit_co
qcow2: add separate file for threaded data processing functions
...
Gerd Hoffmann [Fri, 24 May 2019 07:03:09 +0000 (09:03 +0200)]
usb-hub: add usb_hub_port_update()
Helper function to update port status bits which depends on the
connected device. We need the same logic for device attach and
port reset, so factor it out.
Gerd Hoffmann [Fri, 24 May 2019 07:03:08 +0000 (09:03 +0200)]
usb-hub: add helpers to update port state
Add usb_hub_port_set() and usb_hub_port_clear() helpers which care about
updating the change bits (port->wPortChange) properly, so we don't need
to have that logic sprinkled all over the place ;)
Gerd Hoffmann [Wed, 22 May 2019 09:47:02 +0000 (11:47 +0200)]
usb-host: avoid libusb_set_configuration calls
Seems some devices become confused when we call
libusb_set_configuration(). So before calling the function check
whenever the device has multiple configurations in the first place, and
in case it hasn't (which is the case for the majority of devices) simply
skip the call as it will have no effect anyway.
Gerd Hoffmann [Wed, 22 May 2019 09:47:01 +0000 (11:47 +0200)]
usb-host: skip reset for untouched devices
If the guest didn't talk to the device yet, skip the reset.
Without this usb-host devices get resetted a number of times
at boot time for no good reason.
Add new virtio-gpu devices with a "vhost-user" property. The
associated vhost-user backend is used to handle the virtio rings and
provide rendering results thanks to the vhost-user-gpu protocol.
Example usage:
-object vhost-user-backend,id=vug,cmd="./vhost-user-gpu"
-device vhost-user-vga,vhost-user=vug
Add a base class that is common to virtio-gpu and vhost-user-gpu
devices.
The VirtIOGPUBase base class provides common functionalities necessary
for both virtio-gpu and vhost-user-gpu:
- common configuration (max-outputs, initial resolution, flags)
- virtio device initialization, including queue setup
- device pre-conditions checks (iommu)
- migration blocker
- virtio device callbacks
- hooking up to qemu display subsystem
- a few common helper functions to reset the device, retrieve display
informations
- a class callback to unblock the rendering (for GL updates)
What is left to the virtio-gpu subdevice to take care of, in short,
are all the virtio queues handling, command processing and migration.
Add a vhost-user gpu backend, based on virtio-gpu/3d device. It is
associated with a vhost-user-gpu device.
Various TODO and nice to have items:
- multi-head support
- crash & resume handling
- accelerated rendering/display that avoids the waiting round trips
- edid support
Add a new vhost-user message to give a unix socket to a vhost-user
backend for GPU display updates.
Back when I started that work, I added a new GPU channel because the
vhost-user protocol wasn't bidirectional. Since then, there is a
vhost-user-slave channel for the slave to send requests to the master.
We could extend it with GPU messages. However, the GPU protocol is
quite orthogonal to vhost-user, thus I chose to have a new dedicated
channel.
Cédric Le Goater [Mon, 27 May 2019 07:17:22 +0000 (09:17 +0200)]
ppc/pnv: add dummy XSCOM registers for PRD initialization
PRD (Processor recovery diagnostics) is a service available on
OpenPower systems. The opal-prd daemon initializes the PowerPC
Processor through the XSCOM bus and then waits for hardware diagnostic
events.
Cédric Le Goater [Mon, 27 May 2019 07:17:49 +0000 (09:17 +0200)]
ppc/pnv: introduce new skiboot platform properties
Newer skiboots (after 6.3) support QEMU platforms that have
characteristics closer to real OpenPOWER systems. The CPU type is used
to define the BMC drivers: Aspeed AST2400 for POWER8 processors and
AST2500 for POWER9s.
Advertise the new platform property names, "qemu,powernv8" and
"qemu,powernv9", using the CPU type chosen for the QEMU PowerNV
machine. Also, advertise the original platform name "qemu,powernv" in
case of POWER8 processors for compatibility with older skiboots.
Greg Kurz [Wed, 22 May 2019 13:43:46 +0000 (15:43 +0200)]
spapr: Don't migrate the hpt_maxpagesize cap to older machine types
Commit 0b8c89be7f7b added the hpt_maxpagesize capability to the migration
stream. This is okay for new machine types but it breaks backward migration
to older QEMUs, which don't expect the extra subsection.
Add a compatibility boolean flag to the sPAPR machine class and use it to
skip migration of the capability for machine types 4.0 and older. This
fixes migration to an older QEMU. Note that the destination will emit a
warning:
qemu-system-ppc64: warning: cap-hpt-max-page-size lower level (16) in incoming stream than on destination (24)
This is expected and harmless though. It is okay to migrate from a lower
HPT maximum page size (64k) to a greater one (16M).
Cédric Le Goater [Wed, 22 May 2019 07:40:16 +0000 (09:40 +0200)]
spapr: change default interrupt mode to 'dual'
Now that XIVE support is complete (QEMU emulated and KVM devices),
change the pseries machine to advertise both interrupt modes: XICS
(P7/P8) and XIVE (P9).
The machine default interrupt modes depends on the version. Current
settings are:
pseries default interrupt mode
4.1 dual
4.0 xics
3.1 xics
3.0 legacy xics (different IRQ number space layout)
Cédric Le Goater [Wed, 22 May 2019 07:40:15 +0000 (09:40 +0200)]
spapr/xive: fix multiple resets when using the 'dual' interrupt mode
Today, when a reset occurs on a pseries machine using the 'dual'
interrupt mode, the KVM devices are released and recreated depending
on the interrupt mode selected by CAS. If XIVE is selected, the SysBus
memory regions of the SpaprXive model are initialized by the KVM
backend initialization routine each time a reset occurs. This leads to
a crash after a couple of resets because the machine reaches the
QDEV_MAX_MMIO limit of SysBusDevice :
To fix, initialize the SysBus memory regions in spapr_xive_realize()
called only once and remove the same inits from the QEMU and KVM
backend initialization routines which are called at each reset.
Cédric Le Goater [Mon, 13 May 2019 08:42:45 +0000 (10:42 +0200)]
spapr/irq: add KVM support to the 'dual' machine
The interrupt mode is chosen by the CAS negotiation process and
activated after a reset to take into account the required changes in
the machine. This brings new constraints on how the associated KVM IRQ
device is initialized.
Currently, each model takes care of the initialization of the KVM
device in their realize method but this is not possible anymore as the
initialization needs to be done globaly when the interrupt mode is
known, i.e. when machine is reseted. It also means that we need a way
to delete a KVM device when another mode is chosen.
Also, to support migration, the QEMU objects holding the state to
transfer should always be available but not necessarily activated.
The overall approach of this proposal is to initialize both interrupt
mode at the QEMU level to keep the IRQ number space in sync and to
allow switching from one mode to another. For the KVM side of things,
the whole initialization of the KVM device, sources and presenters, is
grouped in a single routine. The XICS and XIVE sPAPR IRQ reset
handlers are modified accordingly to handle the init and the delete
sequences of the KVM device.
Cédric Le Goater [Mon, 13 May 2019 08:42:44 +0000 (10:42 +0200)]
ppc/xics: fix irq priority in ics_set_irq_type()
Recent commits changed the behavior of ics_set_irq_type() to
initialize correctly LSIs at the KVM level. ics_set_irq_type() is also
called by the realize routine of the different devices of the machine
when initial interrupts are claimed, before the ICSState device is
reseted.
In the case, the ICSIRQState priority is 0x0 and the call to
ics_set_irq_type() results in configuring the target of the
interrupt. On P9, when using the KVM XICS-on-XIVE device, the target
is configured to be server 0, priority 0 and the event queue 0 is
created automatically by KVM.
With the dual interrupt mode creating the KVM device at reset, it
leads to unexpected effects on the guest, mostly blocking IPIs. This
is wrong, fix it by reseting the ICSIRQState structure when
ics_set_irq_type() is called.
Cédric Le Goater [Mon, 13 May 2019 08:42:43 +0000 (10:42 +0200)]
spapr/irq: initialize the IRQ device only once
Add a check to make sure that the routine initializing the emulated
IRQ device is called once. We don't have much to test on the XICS
side, so we introduce a 'init' boolean under ICSState.
Cédric Le Goater [Mon, 13 May 2019 08:42:42 +0000 (10:42 +0200)]
spapr/irq: introduce a spapr_irq_init_device() helper
The way the XICS and the XIVE devices are initialized follows the same
pattern. First, try to connect to the KVM device and if not possible
fallback on the emulated device, unless a kernel_irqchip is required.
The spapr_irq_init_device() routine implements this sequence in
generic way using new sPAPR IRQ handlers ->init_emu() and ->init_kvm().
The XIVE init sequence is moved under the associated sPAPR IRQ
->init() handler. This will change again when KVM support is added for
the dual interrupt mode.
Cédric Le Goater [Mon, 13 May 2019 08:42:41 +0000 (10:42 +0200)]
spapr: check for the activation of the KVM IRQ device
The activation of the KVM IRQ device depends on the interrupt mode
chosen at CAS time by the machine and some methods used at reset or by
the migration need to be protected.
Cédric Le Goater [Mon, 13 May 2019 08:42:40 +0000 (10:42 +0200)]
spapr: introduce routines to delete the KVM IRQ device
If a new interrupt mode is chosen by CAS, the machine generates a
reset to reconfigure. At this point, the connection with the previous
KVM device needs to be closed and a new connection needs to opened
with the KVM device operating the chosen interrupt mode.
New routines are introduced to destroy the XICS and the XIVE KVM
devices. They make use of a new KVM device ioctl which destroys the
device and also disconnects the IRQ presenters from the vCPUs.
Cédric Le Goater [Mon, 13 May 2019 08:42:37 +0000 (10:42 +0200)]
spapr/xive: add migration support for KVM
When the VM is stopped, the VM state handler stabilizes the XIVE IC
and marks the EQ pages dirty. These are then transferred to destination
before the transfer of the device vmstates starts.
The SpaprXive interrupt controller model captures the XIVE internal
tables, EAT and ENDT and the XiveTCTX model does the same for the
thread interrupt context registers.
At restart, the SpaprXive 'post_load' method restores all the XIVE
states. It is called by the sPAPR machine 'post_load' method, when all
XIVE states have been transferred and loaded.
Finally, the source states are restored in the VM change state handler
when the machine reaches the running state.
Cédric Le Goater [Mon, 13 May 2019 08:42:36 +0000 (10:42 +0200)]
spapr/xive: introduce a VM state change handler
This handler is in charge of stabilizing the flow of event notifications
in the XIVE controller before migrating a guest. This is a requirement
before transferring the guest EQ pages to a destination.
When the VM is stopped, the handler sets the source PQs to PENDING to
stop the flow of events and to possibly catch a triggered interrupt
occuring while the VM is stopped. Their previous state is saved. The
XIVE controller is then synced through KVM to flush any in-flight
event notification and to stabilize the EQs. At this stage, the EQ
pages are marked dirty to make sure the EQ pages are transferred if a
migration sequence is in progress.
The previous configuration of the sources is restored when the VM
resumes, after a migration or a stop. If an interrupt was queued while
the VM was stopped, the handler simply generates the missing trigger.
Cédric Le Goater [Mon, 13 May 2019 08:42:35 +0000 (10:42 +0200)]
spapr/xive: add state synchronization with KVM
This extends the KVM XIVE device backend with 'synchronize_state'
methods used to retrieve the state from KVM. The HW state of the
sources, the KVM device and the thread interrupt contexts are
collected for the monitor usage and also migration.
These get operations rely on their KVM counterpart in the host kernel
which acts as a proxy for OPAL, the host firmware. The set operations
will be added for migration support later.
Cédric Le Goater [Mon, 13 May 2019 08:42:34 +0000 (10:42 +0200)]
spapr/xive: add hcall support when under KVM
XIVE hcalls are all redirected to QEMU as none are on a fast path.
When necessary, QEMU invokes KVM through specific ioctls to perform
host operations. QEMU should have done the necessary checks before
calling KVM and, in case of failure, H_HARDWARE is simply returned.
H_INT_ESB is a special case that could have been handled under KVM
but the impact on performance was low when under QEMU. Here are some
figures :
Cédric Le Goater [Mon, 13 May 2019 08:42:33 +0000 (10:42 +0200)]
spapr/xive: add KVM support
This introduces a set of helpers when KVM is in use, which create the
KVM XIVE device, initialize the interrupt sources at a KVM level and
connect the interrupt presenters to the vCPU.
They also handle the initialization of the TIMA and the source ESB
memory regions of the controller. These have a different type under
KVM. They are 'ram device' memory mappings, similarly to VFIO, exposed
to the guest and the associated VMAs on the host are populated
dynamically with the appropriate pages using a fault handler.
David Gibson [Mon, 20 May 2019 05:38:40 +0000 (15:38 +1000)]
spapr: Fix phb_placement backwards compatibility
When we added support for NVLink2 passthrough devices, we changed the
phb_placement hook to handle the placement of NVLink2 bridges' specific
resources. For compatibility we use a version that doesn't do this
allocation for old machine types.
However, because of the delay between when the patch was posted and when
it was merged, we ended up with that compatibility hook applying for
machine versions 3.1 and earlier whereas it should apply for 4.0 and
earlier (since the patch was applied early in the 4.1 tree).
David Gibson [Fri, 17 May 2019 04:10:44 +0000 (14:10 +1000)]
spapr: Add forgotten capability to migration stream
spapr machine capabilities are supposed to be sent in the migration stream
so that we can sanity check the source and destination have compatible
configuration. Unfortunately, when we added the hpt-max-page-size
capability, we forgot to add it to the migration state. This means that we
can generate spurious warnings when both ends are configured for large
pages, or potentially fail to warn if the source is configured for huge
pages, but the destination is not.
Fixes: 2309832afda "spapr: Maximum (HPT) pagesize property" Signed-off-by: David Gibson <[email protected]> Reviewed-by: Cédric Le Goater <[email protected]>
target/ppc: Set PSSCR_EC on cpu halt to prevent spurious wakeup
The processor stop status and control register (PSSCR) is used to
control the power saving facilities of the thread. The exit criterion
bit (EC) is used to specify whether the thread should be woken by any
interrupt (EC == 0) or only an interrupt enabled in the LPCR to wake the
thread (EC == 1).
The rtas facilities start-cpu and self-stop are used to transition a
vcpu between the stopped and running states. When a vcpu is stopped it
may only be started again by the start-cpu rtas call.
Currently a vcpu in the stopped state will start again whenever an
interrupt comes along due to PSSCR_EC being cleared, and while this is
architecturally correct for a hardware thread, a vcpu is expected to
only be woken by calling start-cpu. This means when performing a reboot
on a tcg machine that the secondary threads will restart while the
primary is still in slof, this is unsupported and causes call traces
like:
SLOF **********************************************************************
QEMU Starting
Build Date = Jan 14 2019 18:00:39
FW Version = git-a5b428e1c1eae703
Press "s" to enter Open Firmware.
qemu: fatal: Trying to deliver HV exception (MSR) 70 with no HV support
Greg Kurz [Wed, 15 May 2019 17:04:24 +0000 (19:04 +0200)]
spapr/xive: Sanity checks of OV5 during CAS
If a machine is started with ic-mode=xive but the guest only knows
about XICS, eg. an RHEL 7.6 guest, the kernel panics. This is
expected but a bit unfortunate since the crash doesn't provide
much information for the end user to guess what's happening.
Detect that during CAS and exit QEMU with a proper error message
instead, like it is already done for the MMU.
Even if this is less likely to happen, the opposite case of a guest
that only knows about XIVE would certainly fail all the same if the
machine is started with ic-mode=xics.
Also, the only valid values a guest can pass in byte 23 of OV5 during
CAS are 0b00 (XIVE legacy mode) and 0b01 (XIVE exploitation mode). Any
other value is a bug, at least with the current spec. Again, it does
not seem right to let the guest go on without a precise idea of the
interrupt mode it asked for.
Anton Blanchard [Thu, 9 May 2019 00:35:45 +0000 (10:35 +1000)]
target/ppc: Optimise VSX_LOAD_SCALAR_DS and VSX_VECTOR_LOAD_STORE
A few small optimisations:
In VSX_LOAD_SCALAR_DS() we can don't need to read the VSR via
get_cpu_vsrh().
Split VSX_VECTOR_LOAD_STORE() into two functions. Loads only need to
write the VSRs (set_cpu_vsr*()) and stores only need to read the VSRs
(get_cpu_vsr*())
The high order bits of the address of the OS event queue is stored in
bits [4-31] of word2 of the XIVE END internal structures and the low
order bits in word3. This structure is using Big Endian ordering and
computing the value requires some simple arithmetic which happens to
be wrong. The mask removing bits [0-3] of word2 is applied to the
wrong value and the resulting address is bogus when above 64GB.
Guests with more than 64GB of RAM will allocate pages for the OS event
queues which will reside above the 64GB limit. In this case, the XIVE
device model will wake up the CPUs in case of a notification, such as
IPIs, but the update of the event queue will be written at the wrong
place in memory. The result is uncertain as the guest memory is
trashed and IPI are not delivered.
Introduce a helper xive_end_qaddr() to compute this value correctly in
all places where it is used.
When the OS configures the EQ page in which to receive event
notifications from the XIVE interrupt controller, the page should be
naturally aligned. Add this check.
target/ppc: Add ibm,purr and ibm,spurr device-tree properties
The ibm,purr and ibm,spurr device tree properties are used to indicate
that the processor implements the Processor Utilisation of Resources
Register (PURR) and Scaled Processor Utilisation of Resources Registers
(SPURR), respectively. Each property has a single value which represents
the level of architecture supported. A value of 1 for ibm,purr means
support for the version of the PURR defined in book 3 in version 2.02 of
the architecture. A value of 1 for ibm,spurr means support for the
version of the SPURR defined in version 2.05 of the architecture.
Add these properties for all processors for which the PURR and SPURR
registers are generated.
hw/ppc/40p: Move the MC146818 RTC to the board where it belongs
The MC146818 RTC was incorrectly added to the i82378 chipset in
commit a04ff940974a. In the next commit (506b7ddf8893) the PReP
machine use the i82378.
Since the MC146818 is specific to the PReP machine, move its use
there.