Kevin Wolf [Fri, 22 Nov 2019 15:57:48 +0000 (16:57 +0100)]
qcow2: Declare BDRV_REQ_NO_FALLBACK supported
In the common case, qcow2_co_pwrite_zeroes() already only modifies
metadata case, so we're fine with or without BDRV_REQ_NO_FALLBACK set.
The only exception is when using an external data file, where the
request is passed down to the block driver of the external data file. We
are forwarding the BDRV_REQ_NO_FALLBACK flag there, though, so this is
fine, too.
Kevin Wolf [Tue, 26 Nov 2019 15:45:49 +0000 (16:45 +0100)]
block: Error out on image creation with conflicting size options
If both the create options (qemu-img create -o ...) and the size
parameter were given, the size parameter was silently ignored. Instead,
make specifying two sizes an error.
Stefan Hajnoczi [Thu, 5 Dec 2019 13:46:46 +0000 (13:46 +0000)]
qemu-img: fix info --backing-chain --image-opts
Only apply --image-opts to the topmost image when listing an entire
backing chain. It is incorrect to treat backing filenames as image
options. Assuming we have the backing chain t.IMGFMT.base <-
t.IMGFMT.mid <- t.IMGFMT, qemu-img info fails as follows:
$ qemu-img info --backing-chain --image-opts \
driver=qcow2,file.driver=file,file.filename=t.IMGFMT
qemu-img: Could not open 'TEST_DIR/t.IMGFMT.mid': Cannot find device=TEST_DIR/t.IMGFMT.mid nor node_name=TEST_DIR/t.IMGFMT.mid
Thomas Huth [Mon, 2 Dec 2019 10:16:31 +0000 (11:16 +0100)]
iotests: Skip test 079 if it is not possible to create large files
Test 079 fails in the arm64, s390x and ppc64le LXD containers on Travis
(which we will hopefully enable in our CI soon). These containers
apparently do not allow large files to be created. Test 079 tries to
create a 4G sparse file, which is apparently already too big for these
containers, so check first whether we can really create such files before
executing the test.
Thomas Huth [Mon, 2 Dec 2019 10:16:30 +0000 (11:16 +0100)]
iotests: Skip test 060 if it is not possible to create large files
Test 060 fails in the arm64, s390x and ppc64le LXD containers on Travis
(which we will hopefully enable in our CI soon). These containers
apparently do not allow large files to be created. The repair process
in test 060 creates a file of 64 GiB, so test first whether such large
files are possible and skip the test if that's not the case.
Thomas Huth [Mon, 2 Dec 2019 10:16:29 +0000 (11:16 +0100)]
iotests: Provide a function for checking the creation of huge files
Some tests create huge (but sparse) files, and to be able to run those
tests in certain limited environments (like CI containers), we have to
check for the possibility to create such files first. Thus let's introduce
a common function to check for large files, and replace the already
existing checks in the iotests 005 and 220 with this function.
* remotes/huth-gitlab/tags/pull-request-2019-12-17:
tests: use g_test_rand_int
tests/Makefile: Fix check-report.* targets shown in check-help
glib: use portable g_setenv()
hw/misc/ivshmem: Bury dead legacy INTx code
pseries: disable migration-test if /dev/kvm cannot be used
tests: fix modules-test 'duplicate test case' error
Remove libbluetooth / bluez from the CI tests
Remove the core bluetooth code
hw/usb: Remove the USB bluetooth dongle device
hw/arm/nseries: Replace the bluetooth chardev with a "null" chardev
Peter Maydell [Tue, 17 Dec 2019 14:34:31 +0000 (14:34 +0000)]
Merge remote-tracking branch 'remotes/cleber/tags/python-next-pull-request' into staging
Python queue 2019-12-17
# gpg: Signature made Tue 17 Dec 2019 05:12:43 GMT
# gpg: using RSA key 7ABB96EB8B46B94D5E0FE9BB657E8D33A5F209F3
# gpg: Good signature from "Cleber Rosa <[email protected]>" [marginal]
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg: It is not certain that the signature belongs to the owner.
# Primary key fingerprint: 7ABB 96EB 8B46 B94D 5E0F E9BB 657E 8D33 A5F2 09F3
* remotes/cleber/tags/python-next-pull-request:
python/qemu: Remove unneeded imports in __init__
python/qemu: accel: Add tcg_available() method
python/qemu: accel: Strengthen kvm_available() checks
python/qemu: accel: Add list_accel() method
python/qemu: Move kvm_available() to its own module
Acceptance tests: use relative location for tests
Acceptance tests: use avocado tags for machine type
Acceptance tests: introduce utility method for tags unique vals
Acceptance test x86_cpu_model_versions: use default vm
tests/acceptance: Makes linux_initrd and empty_cpu_model use QEMUMachine
python/qemu: Add set_qmp_monitor() to QEMUMachine
analyze-migration.py: replace numpy with python 3.2
analyze-migration.py: fix find() type error
Revert "Acceptance test: cancel test if m68k kernel packages goes missing"
tests/boot_linux_console: Fetch assets from Debian snapshot archives
* remotes/dgibson/tags/ppc-for-5.0-20191217: (88 commits)
pseries: Update SLOF firmware image
ppc/pnv: Drop PnvChipClass::type
ppc/pnv: Introduce PnvChipClass::xscom_pcba() method
ppc/pnv: Drop pnv_chip_is_power9() and pnv_chip_is_power10() helpers
ppc/pnv: Pass content of the "compatible" property to pnv_dt_xscom()
ppc/pnv: Pass XSCOM base address and address size to pnv_dt_xscom()
ppc/pnv: Introduce PnvChipClass::xscom_core_base() method
ppc/pnv: Introduce PnvChipClass::intc_print_info() method
ppc/pnv: Drop pnv_is_power9() and pnv_is_power10() helpers
ppc/pnv: Introduce PnvMachineClass::dt_power_mgt()
ppc/pnv: Introduce PnvMachineClass and PnvMachineClass::compat
ppc/pnv: Drop PnvPsiClass::chip_type
ppc/pnv: Introduce PnvPsiClass::compat
ppc: Drop useless extern annotation for functions
ppc/pnv: Fix OCC common area region mapping
ppc/pnv: Introduce PBA registers
ppc/pnv: Make PnvXScomInterface an incomplete type
ppc/pnv: populate the DT with realized XSCOM devices
ppc/pnv: Loop on the whole hierarchy to populate the DT with the XSCOM nodes
target/ppc: Add SPR TBU40
...
* remotes/ehabkost/tags/x86-next-pull-request:
i386: Use g_autofree in a few places
i386: Add new CPU model Cooperlake
i386: Add macro for stibp
i386: Add MSR feature bit for MDS-NO
Paolo Bonzini [Thu, 12 Dec 2019 01:17:58 +0000 (02:17 +0100)]
tests: use g_test_rand_int
g_test_rand_int provides a reproducible random integer number, using a
different number seed every time but allowing reproduction using the
--seed command line option. It is thus better suited to tests than
g_random_int or random.
tests/Makefile: Fix check-report.* targets shown in check-help
The check-report.html and check-report.xml targets were replaced
with check-report.tap in commit 9df43317b82 but the check-help
text was not updated so it still lists check-report.html.
Devices "ivshmem-plain" and "ivshmem-doorbell" support only MSI-X.
Config space register Interrupt Pin is zero. Device "ivshmem"
additionally supported legacy INTx, but it was removed in commit 5a0e75f0a9 "hw/misc/ivshmem: Remove deprecated "ivshmem" legacy
device". The commit left ivshmem_update_irq() behind. Since the
Interrupt Pin register is zero, the function does nothing. Remove it.
Laurent Vivier [Wed, 20 Nov 2019 17:09:55 +0000 (18:09 +0100)]
pseries: disable migration-test if /dev/kvm cannot be used
On ppc64, migration-test only works with kvm_hv, and we already
have a check to verify the module is loaded.
kvm_hv module can be loaded in memory and /sys/module/kvm_hv exists,
but on some systems (like build systems) /dev/kvm can be missing
(by administrators choice).
And as kvm_hv exists test-migration is started but QEMU falls back to
TCG because it cannot be used:
Could not access KVM kernel module: No such file or directory
failed to initialize KVM: No such file or directory
Back to tcg accelerator
And as the test is done with TCG, it fails.
As for s390x, we must check for the existence and the access rights
of /dev/kvm.
Thomas Huth [Wed, 20 Nov 2019 09:10:13 +0000 (10:10 +0100)]
Remove the core bluetooth code
It's been deprecated since QEMU v3.1. We've explicitly asked in the
deprecation message that people should speak up on qemu-devel in case
they are still actively using the bluetooth part of QEMU, but nobody
ever replied that they are really still using it.
I've tried it on my own to use this bluetooth subsystem for one of my
guests, but I was also not able to get it running anymore: When I was
trying to pass-through a real bluetooth device, either the guest did
not see the device at all, or the guest crashed.
Even worse for the emulated device: When running
qemu-system-x86_64 -bt device:keyboard
QEMU crashes once you hit a key.
So it seems like the bluetooth stack is not only neglected, it is
completely bitrotten, as far as I can tell. The only attention that
this code got during the past years were some CVEs that have been
spotted there. So this code is a burden for the developers, without
any real benefit anymore. Time to remove it.
Note: hw/bt/Kconfig only gets cleared but not removed here yet.
Otherwise there is a problem with the *-softmmu/config-devices.mak.d
dependency files - they still contain a reference to this file which
gets evaluated first on some build hosts, before the file gets
properly recreated. To avoid breaking these builders, we still need
the file around for some time. It will get removed in a couple of
weeks instead.
Alexey Kardashevskiy (12):
allocator: Fix format strings for DEBUG
virtio: Make virtio_set_qaddr static
client: Load initramdisk location
sloffs: Fix -Wunused-result gcc warnings in read/write
pci-phb: Reimplement dma-map-in/out
virtio: Store queue descriptors in virtio_device
virtio-net: Init queues after features negotiation
virtio: Enable IOMMU
ibm,client-architecture-support: Fix stack handling
fdt: Fix updating the tree at H_CAS
version: update to 20191206
version: update to 20191217
Michael Roth (1):
dma: Define default dma methods for using by client/package instances
The XSCOM bus is implemented with a QOM interface, which is mostly
generic from a CPU type standpoint, except for the computation of
addresses on the Pervasive Connect Bus (PCB) network. This is handled
by the pnv_xscom_pcba() function with a switch statement based on
the chip_type class level attribute of the CPU chip.
This can be achieved using QOM. Also the address argument is masked with
PNV_XSCOM_SIZE - 1, which is for POWER8 only. Addresses may have different
sizes with other CPU types. Have each CPU chip type handle the appropriate
computation with a QOM xscom_pcba() method.
Greg Kurz [Fri, 13 Dec 2019 12:00:24 +0000 (13:00 +0100)]
ppc/pnv: Pass content of the "compatible" property to pnv_dt_xscom()
Since pnv_dt_xscom() is called from chip specific dt_populate() hooks,
it shouldn't have to guess the chip type in order to populate the
"compatible" property. Just pass the compat string and its size as
arguments.
Greg Kurz [Fri, 13 Dec 2019 12:00:18 +0000 (13:00 +0100)]
ppc/pnv: Pass XSCOM base address and address size to pnv_dt_xscom()
Since pnv_dt_xscom() is called from chip specific dt_populate() hooks,
it shouldn't have to guess the chip type in order to populate the "reg"
property. Just pass the base address and address size as arguments.
The pnv_chip_core_realize() function configures the XSCOM MMIO subregion
for each core of a single chip. The base address of the subregion depends
on the CPU type. Its computation is currently open-code using the
pnv_chip_is_powerXX() helpers. This can be achieved with QOM. Introduce
a method for this in the base chip class and implement it in child classes.
The pnv_pic_print_info() callback checks the type of the chip in order
to forward to the request appropriate interrupt controller. This can
be achieved with QOM. Introduce a method for this in the base chip class
and implement it in child classes.
This also prepares ground for the upcoming interrupt controller of POWER10
chips.
We add an extra node to advertise power management on some machines,
namely powernv9 and powernv10. This is achieved by using the
pnv_is_power9() and pnv_is_power10() helpers.
This can be achieved with QOM. Add a method to the base class for
powernv machines and have it implemented by machine types that
support power management instead.
Greg Kurz [Fri, 13 Dec 2019 11:59:50 +0000 (12:59 +0100)]
ppc/pnv: Introduce PnvMachineClass and PnvMachineClass::compat
The pnv_dt_create() function generates different contents for the
"compatible" property of the root node in the DT, depending on the
CPU type. This is open coded with multiple ifs using pnv_is_powerXX()
helpers.
It seems cleaner to achieve with QOM. Introduce a base class for the
powernv machine and a compat attribute that each child class can use
to provide the value for the "compatible" property.
Greg Kurz [Fri, 13 Dec 2019 11:59:39 +0000 (12:59 +0100)]
ppc/pnv: Introduce PnvPsiClass::compat
The Processor Service Interface (PSI) model has a chip_type class level
attribute, which is used to generate the content of the "compatible" DT
property according to the CPU type.
Since the PSI model already has specialized classes for each supported
CPU type, it seems cleaner to achieve this with QOM. Provide the content
of the "compatible" property with a new class level attribute.
We could define a "OCC common area" memory region at the machine level
and sub regions for each OCC. But it adds some extra complexity to the
models. Fix the current layout with a simpler model.
Cédric Le Goater [Wed, 11 Dec 2019 08:29:11 +0000 (09:29 +0100)]
ppc/pnv: Introduce PBA registers
The PBA bridge unit (Power Bus Access) connects the OCC (On Chip
Controller) to the Power bus and System Memory. The PBA is used to
gather sensor data, for power management, for sleep states, for
initial boot, among other things.
The PBA logic provides a set of four registers PowerBus Access Base
Address Registers (PBABAR0..3) which map the OCC address space to the
PowerBus space. These registers are setup by the initial FW and define
the PowerBus Range of system memory that can be accessed by PBA.
The current modeling of the PBABAR registers is done under the common
XSCOM handlers. We introduce a specific XSCOM regions for these
registers and fix :
- BAR sizes and BAR masks
- The mapping of the OCC common area. It is common to all chips and
should be mapped once. We will address per-OCC area in the next
change.
- OCC common area is in BAR 3 on P8
Greg Kurz [Wed, 11 Dec 2019 16:04:15 +0000 (17:04 +0100)]
ppc/pnv: Make PnvXScomInterface an incomplete type
PnvXScomInterface is an interface instance. It should never be
dereferenced. Drop the dummy type definition for extra safety,
which is the common practice with QOM interfaces.
While here also convert the bogus OBJECT_CHECK() to INTERFACE_CHECK().
Cédric Le Goater [Tue, 10 Dec 2019 13:58:45 +0000 (14:58 +0100)]
ppc/pnv: populate the DT with realized XSCOM devices
Some devices could be initialized in the instance_init handler but not
realized for configuration reasons. Nodes should not be added in the DT
for such devices.
The Access Segment Descriptor Register (ASDR) provides information about
the storage element when taking a hypervisor storage interrupt. When
performing nested radix address translation, this is normally the guest
real address. This register is present on POWER9 processors and later.
Implement the ADSR, note read and write access is limited to the
hypervisor.
target/ppc: Work [S]PURR implementation and add HV support
The Processor Utilisation of Resources Register (PURR) and Scaled
Processor Utilisation of Resources Register (SPURR) provide an estimate
of the resources used by the thread, present on POWER7 and later
processors.
Currently the [S]PURR registers simply count at the rate of the
timebase.
Preserve this behaviour but rework the implementation to store an offset
like the timebase rather than doing the calculation manually. Also allow
hypervisor write access to the register along with the currently
available read access.
The virtual timebase register (VTB) is a 64-bit register which
increments at the same rate as the timebase register, present on POWER8
and later processors.
The register is able to be read/written by the hypervisor and read by
the supervisor. All other accesses are illegal.
Currently the VTB is just an alias for the timebase (TB) register.
Implement the VTB so that is can be read/written independent of the TB.
Make use of the existing method for accessing timebase facilities where
by the compensation is stored and used to compute the value on reads/is
updated on writes.
This includes in QEMU a new CPU model for the POWER10 processor with
the same capabilities of a POWER9 process. The model will be extended
when support is completed.
Greg Kurz [Mon, 9 Dec 2019 13:28:00 +0000 (14:28 +0100)]
ppc: Make PPCVirtualHypervisor an incomplete type
PPCVirtualHypervisor is an interface instance. It should never be
dereferenced. Drop the dummy type definition for extra safety, which
is the common practice with QOM interfaces.
Greg Kurz [Wed, 4 Dec 2019 19:43:48 +0000 (20:43 +0100)]
ppc: Don't use CPUPPCState::irq_input_state with modern Book3s CPU models
The power7_set_irq() and power9_set_irq() functions set this but it is
never used actually. Modern Book3s compatible CPUs are only supported
by the pnv and spapr machines. They have an interrupt controller, XICS
for POWER7/8 and XIVE for POWER9, whose models don't require to track
IRQ input states at the CPU level.
Greg Kurz [Wed, 4 Dec 2019 19:43:37 +0000 (20:43 +0100)]
ppc: Deassert the external interrupt pin in KVM on reset
When a CPU is reset, QEMU makes sure no interrupt is pending by clearing
CPUPPCstate::pending_interrupts in ppc_cpu_reset(). In the case of a
complete machine emulation, eg. a sPAPR machine, an external interrupt
request could still be pending in KVM though, eg. an IPI. It will be
eventually presented to the guest, which is supposed to acknowledge it at
the interrupt controller. If the interrupt controller is emulated in QEMU,
either XICS or XIVE, ppc_set_irq() won't deassert the external interrupt
pin in KVM since it isn't pending anymore for QEMU. When the vCPU re-enters
the guest, the interrupt request is still pending and the vCPU will try
again to acknowledge it. This causes an infinite loop and eventually hangs
the guest.
The code has been broken since the beginning. The issue wasn't hit before
because accel=kvm,kernel-irqchip=off is an awkward setup that never got
used until recently with the LC92x IBM systems (aka, Boston).
Add a ppc_irq_reset() function to do the necessary cleanup, ie. deassert
the IRQ pins of the CPU in QEMU and most importantly the external interrupt
pin for this vCPU in KVM.
David Gibson [Fri, 29 Nov 2019 05:23:21 +0000 (16:23 +1100)]
spapr: Simplify ovec diff
spapr_ovec_diff(ov, old, new) has somewhat complex semantics. ov is set
to those bits which are in new but not old, and it returns as a boolean
whether or not there are any bits in old but not new.
It turns out that both callers only care about the second, not the first.
This is basically equivalent to a bitmap subset operation, which is easier
to understand and implement. So replace spapr_ovec_diff() with
spapr_ovec_subset().
David Gibson [Fri, 29 Nov 2019 04:00:58 +0000 (15:00 +1100)]
spapr: Fold h_cas_compose_response() into h_client_architecture_support()
spapr_h_cas_compose_response() handles the last piece of the PAPR feature
negotiation process invoked via the ibm,client-architecture-support OF
call. Its only caller is h_client_architecture_support() which handles
most of the rest of that process.
I believe it was placed in a separate file originally to handle some
fiddly dependencies between functions, but mostly it's just confusing
to have the CAS process split into two pieces like this. Now that
compose response is simplified (by just generating the whole device
tree anew), it's cleaner to just fold it into
h_client_architecture_support().
David Gibson [Fri, 18 Oct 2019 10:25:49 +0000 (21:25 +1100)]
spapr: Improve handling of fdt buffer size
Previously, spapr_build_fdt() constructed the device tree in a fixed
buffer of size FDT_MAX_SIZE. This is a bit inflexible, but more
importantly it's awkward for the case where we use it during CAS. In
that case the guest firmware supplies a buffer and we have to
awkwardly check that what we generated fits into it afterwards, after
doing a lot of size checks during spapr_build_fdt().
Simplify this by having spapr_build_fdt() take a 'space' parameter.
For the CAS case, we pass in the buffer size provided by SLOF, for the
machine init case, we continue to pass FDT_MAX_SIZE.
David Gibson [Fri, 18 Oct 2019 04:19:31 +0000 (15:19 +1100)]
spapr: Don't trigger a CAS reboot for XICS/XIVE mode changeover
PAPR allows the interrupt controller used on a POWER9 machine (XICS or
XIVE) to be selected by the guest operating system, by using the
ibm,client-architecture-support (CAS) feature negotiation call.
Currently, if the guest selects an interrupt controller different from the
one selected at initial boot, this causes the system to be reset with the
new model and the boot starts again. This means we run through the SLOF
boot process twice, as well as any other bootloader (e.g. grub) in use
before the OS calls CAS. This can be confusing and/or inconvenient for
users.
Thanks to two fairly recent changes, we no longer need this reboot. 1) we
now completely regenerate the device tree when CAS is called (meaning we
don't need special case updates for all the device tree changes caused by
the interrupt controller mode change), 2) we now have explicit code paths
to activate and deactivate the different interrupt controllers, rather than
just implicitly calling those at machine reset time.
We can therefore eliminate the reboot for changing irq mode, simply by
putting a call to spapr_irq_update_active_intc() before we call
spapr_h_cas_compose_response() (which gives the updated device tree to
the guest firmware and OS).
ppc: well form kvmppc_hint_smt_possible error hint helper
Make kvmppc_hint_smt_possible hint append helper well formed:
rename errp to errp_in, as it is IN-parameter here (which is unusual
for errp), rename function to be kvmppc_error_append_*_hint.
Cédric Le Goater [Mon, 25 Nov 2019 06:58:20 +0000 (07:58 +0100)]
ppc/pnv: Dump the XIVE NVT table
This is useful to dump the saved contexts of the vCPUs : configuration
of the base END index of the vCPU and the Interrupt Pending Buffer
register, which is updated when an interrupt can not be presented.
When dumping the NVT table, we skip empty indirect pages which are not
necessarily allocated.
Cédric Le Goater [Mon, 25 Nov 2019 06:58:18 +0000 (07:58 +0100)]
ppc/pnv: Introduce a pnv_xive_block_id() helper
When PC_TCTXT_CHIPID_OVERRIDE is configured, the PC_TCTXT_CHIPID field
overrides the hardwired chip ID in the Powerbus operations and for CAM
compares. This is typically used in the one block-per-chip configuration
to associate a unique block id number to each IC of the system.
Simplify the model with a pnv_xive_block_id() helper and remove
'tctx_chipid' which becomes useless.
Cédric Le Goater [Mon, 25 Nov 2019 06:58:17 +0000 (07:58 +0100)]
ppc/xive: Synthesize interrupt from the saved IPB in the NVT
When a vCPU is dispatched on a HW thread, its context is pushed in the
thread registers and it is activated by setting the VO bit in the CAM
line word2. The HW grabs the associated NVT, pulls the IPB bits and
merges them with the IPB of the new context. If interrupts were missed
while the vCPU was not dispatched, these are synthesized in this
sequence.
Cédric Le Goater [Mon, 25 Nov 2019 06:58:14 +0000 (07:58 +0100)]
ppc/xive: Move the TIMA operations to the controller model
On the P9 Processor, the thread interrupt context registers of a CPU
can be accessed "directly" when by load/store from the CPU or
"indirectly" by the IC through an indirect TIMA page. This requires to
configure first the PC_TCTXT_INDIRx registers.
Today, we rely on the get_tctx() handler to deduce from the CPU PIR
the chip from which the TIMA access is being done. By handling the
TIMA memory ops under the interrupt controller model of each machine,
we can uniformize the TIMA direct and indirect ops under PowerNV. We
can also check that the CPUs have been enabled in the XIVE controller.
This prepares ground for the future versions of XIVE.
Cédric Le Goater [Mon, 25 Nov 2019 06:58:13 +0000 (07:58 +0100)]
ppc/pnv: Clarify how the TIMA is accessed on a multichip system
The TIMA region gives access to the thread interrupt context registers
of a CPU. It is mapped at the same address on all chips and can be
accessed by any CPU of the system. To identify the chip from which the
access is being done, the PowerBUS uses a 'chip' field in the
load/store messages. QEMU does not model these messages, instead, we
extract the chip id from the CPU PIR and do a lookup at the machine
level to fetch the targeted interrupt controller.
Introduce pnv_get_chip() and pnv_xive_tm_get_xive() helpers to clarify
this process in pnv_xive_get_tctx(). The latter will be removed in the
subsequent patches but the same principle will be kept.
Greg Kurz [Tue, 26 Nov 2019 16:46:33 +0000 (17:46 +0100)]
spapr/xive: Configure number of servers in KVM
The XIVE KVM devices now has an attribute to configure the number of
interrupt servers. This allows to greatly optimize the usage of the VP
space in the XIVE HW, and thus to start a lot more VMs.
Only set this attribute if available in order to support older POWER9
KVM.
The XIVE KVM device now reports the exhaustion of VPs upon the
connection of the first VCPU. Check that in order to have a chance
to provide a hint to the user.
Greg Kurz [Tue, 26 Nov 2019 16:46:28 +0000 (17:46 +0100)]
spapr/xics: Configure number of servers in KVM
The XICS-on-XIVE KVM devices now has an attribute to configure the number
of interrupt servers. This allows to greatly optimize the usage of the VP
space in the XIVE HW, and thus to start a lot more VMs.
Only set this attribute if available in order to support older POWER9 KVM
and pre-POWER9 XICS KVM devices.
Greg Kurz [Tue, 26 Nov 2019 16:46:23 +0000 (17:46 +0100)]
spapr: Pass the maximum number of vCPUs to the KVM interrupt controller
The XIVE and XICS-on-XIVE KVM devices on POWER9 hosts can greatly reduce
their consumption of some scarce HW resources, namely Virtual Presenter
identifiers, if they know the maximum number of vCPUs that may run in the
VM.
Prepare ground for this by passing the value down to xics_kvm_connect()
and kvmppc_xive_connect(). This is purely mechanical, no functional
change.
Cédric Le Goater [Mon, 25 Nov 2019 06:58:12 +0000 (07:58 +0100)]
ppc/xive: Extend the TIMA operation with a XivePresenter parameter
The TIMA operations are performed on behalf of the XIVE IVPE sub-engine
(Presenter) on the thread interrupt context registers. The current
operations supported by the model are simple and do not require access
to the controller but more complex operations will need access to the
controller NVT table and to its configuration.
Cédric Le Goater [Mon, 25 Nov 2019 06:58:11 +0000 (07:58 +0100)]
ppc/xive: Use the XiveFabric and XivePresenter interfaces
Now that the machines have handlers implementing the XiveFabric and
XivePresenter interfaces, remove xive_presenter_match() and make use
of the 'match_nvt' handler of the machine.
Cédric Le Goater [Mon, 25 Nov 2019 06:58:10 +0000 (07:58 +0100)]
ppc/spapr: Implement the XiveFabric interface
The CAM line matching sequence in the pseries machine does not change
much apart from the use of the new QOM interfaces. There is an extra
indirection because of the sPAPR IRQ backend of the machine. Only the
XIVE backend implements the new 'match_nvt' handler.
Cédric Le Goater [Mon, 25 Nov 2019 06:58:08 +0000 (07:58 +0100)]
ppc/xive: Introduce a XiveFabric interface
The XiveFabric QOM interface acts as the PowerBUS interface between
the interrupt controller and the system and should be implemented by
the QEMU machine. On HW, the XIVE sub-engine is responsible for the
communication with the other chip is the Common Queue (CQ) bridge
unit.
This interface offers a 'match_nvt' handler to perform the CAM line
matching when looking for a XIVE Presenter with a dispatched NVT.
Cédric Le Goater [Mon, 25 Nov 2019 06:58:07 +0000 (07:58 +0100)]
ppc/pnv: Fix TIMA indirect access
When the TIMA of a CPU needs to be accessed from the indirect page,
the thread id of the target CPU is first stored in the PC_TCTXT_INDIR0
register. This thread id is relative to the chip and not to the system.
Introduce a helper routine to look for a CPU of a given PIR and fix
pnv_xive_get_indirect_tctx() to scan only the threads of the local
chip and not the whole machine.
Greg Kurz [Mon, 25 Nov 2019 06:58:03 +0000 (07:58 +0100)]
ppc/pnv: Instantiate cores separately
Allocating a big void * array to store multiple objects isn't a
recommended practice for various reasons:
- no compile time type checking
- potential dangling pointers if a reference on an individual is
taken and the array is freed later on
- duplicate boiler plate everywhere the array is browsed through
Allocate an array of pointers and populate it instead.
Cédric Le Goater [Mon, 25 Nov 2019 06:58:02 +0000 (07:58 +0100)]
ppc/xive: Implement the XivePresenter interface
Each XIVE Router model, sPAPR and PowerNV, now implements the 'match_nvt'
handler of the XivePresenter QOM interface. This is simply moving code
and taking into account the new API.
To be noted that the xive_router_get_tctx() helper is not used anymore
when doing CAM matching and will be removed later on after other changes.
The XIVE presenter model is still too simple for the PowerNV machine
and the CAM matching algo is not correct on multichip system. Subsequent
patches will introduce more changes to scan all chips of the system.
Cédric Le Goater [Mon, 25 Nov 2019 06:58:01 +0000 (07:58 +0100)]
ppc/xive: Introduce a XivePresenter interface
When the XIVE IVRE sub-engine (XiveRouter) looks for a Notification
Virtual Target (NVT) to notify, it broadcasts a message on the
PowerBUS to find an XIVE IVPE sub-engine (Presenter) with the NVT
dispatched on one of its HW threads, and then forwards the
notification if any response was received.
The current XIVE presenter model is sufficient for the pseries machine
because it has a single interrupt controller device, but the PowerNV
machine can have multiple chips each having its own interrupt
controller. In this case, the XIVE presenter model is too simple and
the CAM line matching should scan all chips of the system.
To start fixing this issue, we first extend the XIVE Router model with
a new XivePresenter QOM interface representing the XIVE IVPE
sub-engine. This interface exposes a 'match_nvt' handler which the
sPAPR and PowerNV XIVE Router models will need to implement to perform
the CAM line matching.
Cédric Le Goater [Thu, 21 Nov 2019 16:23:40 +0000 (17:23 +0100)]
ppc/pnv: Create BMC devices at machine init
The BMC of the OpenPOWER systems monitors the machine state using
sensors, controls the power and controls the access to the PNOR flash
device containing the firmware image required to boot the host.
QEMU models the power cycle process, access to the sensors and access
to the PNOR device. But, for these features to be available, the QEMU
PowerNV machine needs two extras devices on the command line, an IPMI
BT device for communication and a BMC backend device:
The BMC properties are then defined accordingly in the device tree and
OPAL self adapts. If a BMC device and an IPMI BT device are not
available, OPAL does not try to communicate with the BMC in any
manner. This is not how real systems behave.
To be closer to the default behavior, create an IPMI BMC simulator
device and an IPMI BT device at machine initialization time. We loose
the ability to define an external BMC device but there are benefits:
- a better match with real systems,
- a better test coverage of the OPAL code,
- system powerdown and reset commands that work,
- a QEMU device tree compliant with the specifications (*).
Cédric Le Goater [Mon, 28 Oct 2019 07:00:27 +0000 (08:00 +0100)]
ppc/pnv: Add HIOMAP commands
This activates HIOMAP support on the QEMU PowerNV machine. The PnvPnor
model is used to access the flash contents. The model simply maps the
contents at a fix offset and enables or disables the mapping.
Cédric Le Goater [Fri, 15 Nov 2019 16:24:19 +0000 (17:24 +0100)]
ppc/xive: Introduce OS CAM line helpers
The OS CAM line has a special encoding exploited by the HW. Provide
helper routines to hide the details to the TIMA command handlers. This
also clarifies the endianness of different variables : 'qw1w2' is
big-endian and 'cam' is native.
Greg Kurz [Mon, 18 Nov 2019 15:12:07 +0000 (16:12 +0100)]
xive/kvm: Trigger interrupts from userspace
When using the XIVE KVM device, the trigger page is directly accessible
in QEMU. Unlike with XICS, no need to ask KVM to fire the interrupt. A
simple store on the trigger page does the job.
Just call xive_esb_trigger().
This may improve performance of emulated devices that go through
qemu_set_irq(), eg. virtio devices created with ioeventfd=off or
configured by the guest to use LSI interrupts, which aren't really
recommended setups.
Cédric Le Goater [Fri, 15 Nov 2019 16:24:16 +0000 (17:24 +0100)]
ppc/pnv: Remove pnv_xive_vst_size() routine
pnv_xive_vst_size() tries to compute the size of a VSD table from the
information given by FW. The number of entries of the table are
deduced from the result and the MMIO regions of the ESBs and the END
ESBs are then resized accordingly with the computed value. This
reduces the number of elements that can be addressed by the ESB pages.
The maximum number of elements of a direct table can contain is simply:
Table size / sizeof(XIVE structure)
An indirect table is a one page array of VSDs pointing to subpages
containing XIVE virtual structures and the maximum number of elements
an indirect table can contain :
which gives us 16M for XiveENDs, 8M for XiveNVTs. That's more than the
associated VC and PC BARS can address.
The result returned by pnv_xive_vst_size() for indirect tables is
incorrect and can not be used to reduce the size of the MMIO region of
a XIVE resource using an indirect table, such as ENDs in skiboot.
Remove pnv_xive_vst_size() and use a simpler form for direct tables
only. Keep the resizing of the MMIO region for direct tables only as
this is still useful for the ESB MMIO window.
Cédric Le Goater [Fri, 15 Nov 2019 16:24:15 +0000 (17:24 +0100)]
ppc/xive: Introduce helpers for the NVT id
Each vCPU in the system is identified with an NVT identifier which is
pushed in the OS CAM line (QW1W2) of the HW thread interrupt context
register when the vCPU is dispatched on a HW thread. This identifier
is used by the presenter subengine to find a matching target to notify
of an event. It is also used to fetch the associate NVT structure
which may contain pending interrupts that need a resend.
Add a couple of helpers for the NVT ids. The NVT space is 19 bits
wide, giving a maximum of 512K per chip.
Cédric Le Goater [Fri, 15 Nov 2019 16:24:14 +0000 (17:24 +0100)]
ppc/xive: Record the IPB in the associated NVT
When an interrupt can not be presented to a vCPU, because it is not
running on any of the HW treads, the XIVE presenter updates the
Interrupt Pending Buffer register of the associated XIVE NVT
structure. This is only done if backlog is activated in the END but
this is generally the case.
The current code assumes that the fields of the NVT structure is
architected with the same layout of the thread interrupt context
registers. Fix this assumption and define an offset for the IPB
register backup value in the NVT.
Greg Kurz [Sun, 17 Nov 2019 23:20:52 +0000 (00:20 +0100)]
spapr: Abort if XICS interrupt controller cannot be initialized
Failing to set any of the ICS property should really never happen:
- object_property_add_child() always succeed unless the child object
already has a parent, which isn't the case here obviously since the
ICS has just been created with object_new()
- the ICS has an "nr-irqs" property than can be set as long as the ICS
isn't realized
In both cases, an error indicates there is a bug in QEMU. Propagating the
error, ie. exiting QEMU since spapr_irq_init() is called with &error_fatal
doesn't make much sense. Abort instead. This is consistent with what is
done with XIVE : both qdev_create() and qdev_prop_set_uint32() abort QEMU
on error.
Greg Kurz [Sun, 17 Nov 2019 23:20:47 +0000 (00:20 +0100)]
xics: Link ICP_PROP_CPU property to ICPState::cs pointer
The ICP object has both a pointer and an ICP_PROP_CPU property pointing
to the cpu. Confusing bugs could arise if these ever go out of sync.
Change the property definition so that it explicitly sets the pointer.
The property isn't optional : not being able to set the link is a bug
and QEMU should rather abort than exit in this case.
Greg Kurz [Sun, 17 Nov 2019 23:20:41 +0000 (00:20 +0100)]
xics: Link ICP_PROP_XICS property to ICPState::xics pointer
The ICP object has both a pointer and an ICP_PROP_XICS property pointing
to the XICS fabric. Confusing bugs could arise if these ever go out of
sync.
Change the property definition so that it explicitly sets the pointer.
The property isn't optional : not being able to set the link is a bug
and QEMU should rather abort than exit in this case.
Greg Kurz [Sun, 17 Nov 2019 23:20:36 +0000 (00:20 +0100)]
xics: Link ICS_PROP_XICS property to ICSState::xics pointer
The ICS object has both a pointer and an ICS_PROP_XICS property pointing
to the XICS fabric. Confusing bugs could arise if these ever go out of
sync.
Change the property definition so that it explicitely sets the pointer.
The property isn't optional : not being able to set the link is a bug
and QEMU should rather abort than exit in this case.