exec time (s) Relative slowdown wrt original (%)
---------------------------------------------------------------
original 20.213321616 0.
tcg_malloc 20.441130078 1.1270214
TCGContext 20.477846517 1.3086662
g_malloc 20.780527895 2.8061013
The other two alternatives shown in the table are:
- TCGContext: embed temps[TCG_MAX_TEMPS] and TCGTempSet used_temps
in TCGContext. This is simple enough but it isn't faster than using
tcg_malloc; moreover, it wastes memory.
- g_malloc: allocate/deallocate both temps and used_temps every time
tcg_optimize is executed.
tcg: distribute profiling counters across TCGContext's
This is groundwork for supporting multiple TCG contexts.
To avoid scalability issues when profiling info is enabled, this patch
makes the profiling info counters distributed via the following changes:
1) Consolidate profile info into its own struct, TCGProfile, which
TCGContext also includes. Note that tcg_table_op_count is brought
into TCGProfile after dropping the tcg_ prefix.
2) Iterate over the TCG contexts in the system to obtain the total counts.
This change also requires updating the accessors to TCGProfile fields to
use atomic_read/set whenever there may be conflicting accesses (as defined
in C11) to them.
tcg: introduce **tcg_ctxs to keep track of all TCGContext's
Groundwork for supporting multiple TCG contexts.
Note that having n_tcg_ctxs is unnecessary. However, it is
convenient to have it, since it will simplify iterating over the
array: we'll have just a for loop instead of having to iterate
over a NULL-terminated array (which would require n+1 elems)
or having to check with ifdef's for usermode/softmmu.
Emilio G. Cota [Sat, 24 Jun 2017 00:57:44 +0000 (20:57 -0400)]
translate-all: report correct avg host TB size
Since commit 6e3b2bfd6 ("tcg: allocate TB structs before the
corresponding translated code") we are not fully utilizing
code_gen_buffer for translated code, and therefore are
incorrectly reporting the amount of translated code as well as
the average host TB size. Address this by:
- Making the conscious choice of misreporting the total translated code;
doing otherwise would mislead users into thinking "-tb-size" is not
honoured.
- Expanding tb_tree_stats to accurately count the bytes of translated code on
the host, and using this for reporting the average tb host size,
as well as the expansion ratio.
In the future we might want to consider reporting the accurate numbers for
the total translated code, together with a "bookkeeping/overhead" field to
account for the TB structs.
Emilio G. Cota [Fri, 23 Jun 2017 23:00:11 +0000 (19:00 -0400)]
translate-all: use a binary search tree to track TBs in TBContext
This is a prerequisite for supporting multiple TCG contexts, since
we will have threads generating code in separate regions of
code_gen_buffer.
For this we need a new field (.size) in struct tb_tc to keep
track of the size of the translated code. This field uses a size_t
to avoid adding a hole to the struct, although really an unsigned
int would have been enough.
The comparison function we use is optimized for the common case:
insertions. Profiling shows that upon booting debian-arm, 98%
of comparisons are between existing tb's (i.e. a->size and b->size
are both !0), which happens during insertions (and removals, but
those are rare). The remaining cases are lookups. From reading the glib
sources we see that the first key is always the lookup key. However,
the code does not assume this to always be the case because this
behaviour is not guaranteed in the glib docs. However, we embed
this knowledge in the code as a branch hint for the compiler.
Note that tb_free does not free space in the code_gen_buffer anymore,
since we cannot easily know whether the tb is the last one inserted
in code_gen_buffer. The next patch in this series renames tb_free
to tb_remove to reflect this.
Performance-wise, lookups in tb_find_pc are the same as before:
O(log n). However, insertions are O(log n) instead of O(1), which
results in a small slowdown when booting debian-arm:
cpu-exec: lookup/generate TB outside exclusive region during step_atomic
Now that all code generation has been converted to check CF_PARALLEL, we can
generate !CF_PARALLEL code without having yet set !parallel_cpus --
and therefore without having to be in the exclusive region during
cpu_exec_step_atomic.
While at it, merge cpu_exec_step into cpu_exec_step_atomic.
We were generating code during tb_invalidate_phys_page_range,
check_watchpoint, cpu_io_recompile, and (seemingly) discarding
the TB, assuming that it would magically be picked up during
the next iteration through the cpu_exec loop.
Instead, record the desired cflags in CPUState so that we request
the proper TB so that there is no more magic.
tcg: define CF_PARALLEL and use it for TB hashing along with CF_COUNT_MASK
This will enable us to decouple code translation from the value
of parallel_cpus at any given time. It will also help us minimize
TB flushes when generating code via EXCP_ATOMIC.
Note that the declaration of parallel_cpus is brought to exec-all.h
to be able to define there the "curr_cflags" inline.
Using the offset of a temporary, relative to TCGContext, rather than
its index means that we don't use 0. That leaves offset 0 free for
a NULL representation without having to leave index 0 unused.
The GET and MAKE functions weren't really specific enough.
We now have a full complement of functions that convert exactly
between temporaries, arguments, tcgv pointers, and indices.
The target/sparc change is also a bug fix, which would have affected
a host that defines TCG_TARGET_HAS_extr[lh]_i64_i32, i.e. MIPS64.
Copy s->nb_globals or s->nb_temps to a local variable for the purposes
of iteration. This should allow the compiler to use low-overhead
looping constructs on some hosts.
Rather than have a separate buffer of 10*max_ops entries,
give each opcode 10 entries. The result is actually a bit
smaller and should have slightly more cache locality.
* remotes/kraxel/tags/fixes-20171023-pull-request:
scripts: don't throw away stderr when checking out git submodules
ui: add qemu-keymap and shader to .gitignore
configure: disable qemu-keymap for linux-user qemu
# gpg: Signature made Fri 20 Oct 2017 22:51:16 BST
# gpg: using RSA key 0xC3B31C2D5E6627E4
# gpg: Good signature from "Stafford Horne <[email protected]>"
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: D9C4 7354 AEF8 6C10 3A25 EFF1 C3B3 1C2D 5E66 27E4
* remotes/shorne/tags/openrisc-20171021-smp-pr:
openrisc: Only kick cpu on timeout, not on update
openrisc: Initial SMP support
openrisc/cputimer: Perparation for Multicore
target/openrisc: Make coreid and numcores variable
openrisc/ompic: Add OpenRISC Multicore PIC (OMPIC)
Laurent Vivier [Thu, 19 Oct 2017 19:16:06 +0000 (21:16 +0200)]
configure: disable qemu-keymap for linux-user qemu
We don't need qemu-keymap when we build only linux-user qemu.
When we compile in static mode, the libxkbcommon is detected
by configure if the shared one is available, but cannot
be linked if the static version is not available.
As we don't need it for qemu-linux-user, and we generally need
a static link to use it in a chroot, disable qemu-keymap in
this case.
Previously we were kicking the cpu on every update. This caused
problems noticeable in SMP configurations where one CPU got pinned
continuously servicing timer exceptions.
Stafford Horne [Fri, 20 Oct 2017 21:36:58 +0000 (06:36 +0900)]
openrisc: Initial SMP support
Wire in ompic and add basic support for SMP. The OpenRISC is special in
that interrupts for devices are routed to each core's PIC. This is
achieved using the qemu_irq_split utility, but this currently limits
OpenRISC to 2 cores.
This models the reference architecture described in the OpenRISC spec
1.2 proposal.
The changes to the intialization of the sim include:
CPU Reset
o Reset each cpu to the bootstrap PC rather than only a single cpu as
done before.
o During Kernel loading the bootstrap PC is saved in a static global.
Network Initialization
o Connect the interrupt to each CPU
o Use more simple sysbus_mmio_map() rather than memory_region_add_subregion()
Sim Initialization
o Initialize the pic and tick timer per cpu
o Wire in the OMPIC if SMP is enabled
o Wire the serial irq to each CPU using qemu_irq_split()
Stafford Horne [Mon, 21 Aug 2017 21:37:10 +0000 (06:37 +0900)]
openrisc/cputimer: Perparation for Multicore
In order to support multicore system we move some of the previously
static state variables into the state of each core.
On the other hand in order to allow timers to be synced between each
code the ttcr (tick timer count register) is moved out of the core.
This is not as per real hardware spec which has a separate timer counter
per core, but it seems the most simple way to keep each clock in sync.
Add OpenRISC Multicore PIC which handles inter processor interrupts
(IPI) between cores. In OpenRISC all device interrupts are routed to
each core enabling this device to be simple.
Peter Maydell [Fri, 20 Oct 2017 12:33:32 +0000 (13:33 +0100)]
Merge remote-tracking branch 'remotes/cohuck/tags/s390x-20171020' into staging
The last big chunk of s390x changes:
- (experimental) smp support under tcg
- provide the virtio-input devices for virtio-ccw
- improve error handling in the css code
- enable some simple virtio tests for s390x
- low-address protection in tcg
- some more cleanups and fixes
* remotes/cohuck/tags/s390x-20171020: (46 commits)
s390x/tcg: low-address protection support
accel/tcg: allow to invalidate a write TLB entry immediately
tests: Enable the very simple virtio tests on s390x, too
libqtest: Add qtest_[v]startf()
s390x: refactor error handling for MSCH handler
s390x: refactor error handling for HSCH handler
s390x: refactor error handling for CSCH handler
s390x: refactor error handling for XSCH handler
s390x: improve error handling for SSCH and RSCH
s390x/css: IO instr handler ending control
s390x: move s390x_new_cpu() into board code
s390x: fix cpu object referrence leak in s390x_new_cpu()
s390x/event-facility: variable-length event masks
s390x/MAINTAINERS: add mailing list
virtio-ccw: Add the virtio-input devices for CCW bus
target/s390x: special handling when starting a CPU with WAIT PSW
s390x/tcg: refactor stfl(e) to use s390_get_feat_block()
s390x/tcg: unlock NMI
s390x/cpumodel: allow to enable SENSE RUNNING STATUS for qemu
s390x/tcg: switch to new SIGP handling code
...
Peter Maydell [Fri, 20 Oct 2017 11:45:56 +0000 (12:45 +0100)]
Merge remote-tracking branch 'remotes/famz/tags/docker-pull-request' into staging
# gpg: Signature made Fri 20 Oct 2017 07:30:45 BST
# gpg: using RSA key 0xCA35624C6A9171C6
# gpg: Good signature from "Fam Zheng <[email protected]>"
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg: It is not certain that the signature belongs to the owner.
# Primary key fingerprint: 5003 7CB7 9706 0F76 F021 AD56 CA35 624C 6A91 71C6
* remotes/famz/tags/docker-pull-request:
docker: Fix PATH for ccache
docker: fix out-of-tree 'make docker-test-build@debian-powerpc-cross'
docker: allow running from srcdir != builddir build
docker: cleanup temp directory after test
docker: Don't allocate tty unless DEBUG=1
This is a neat way to implement low address protection, whereby
only the first 512 bytes of the first two pages (each 4096 bytes) of
every address space are protected.
Store a tec of 0 for the access exception, this is what is defined by
Enhanced Suppression on Protection in case of a low address protection
(Bit 61 set to 0, rest undefined).
We have to make sure to to pass the access address, not the masked page
address into mmu_translate*().
Drop the check from testblock. So we can properly test this via
kvm-unit-tests.
This will check every access going through one of the MMUs.
accel/tcg: allow to invalidate a write TLB entry immediately
Background: s390x implements Low-Address Protection (LAP). If LAP is
enabled, writing to effective addresses (before any translation)
0-511 and 4096-4607 triggers a protection exception.
So we have subpage protection on the first two pages of every address
space (where the lowcore - the CPU private data resides).
By immediately invalidating the write entry but allowing the caller to
continue, we force every write access onto these first two pages into
the slow path. we will get a tlb fault with the specific accessed
addresses and can then evaluate if protection applies or not.
We have to make sure to ignore the invalid bit if tlb_fill() succeeds.
Eric Blake [Wed, 18 Oct 2017 14:20:27 +0000 (16:20 +0200)]
libqtest: Add qtest_[v]startf()
We have several callers that were formatting the argument strings
themselves; consolidate this effort by adding new convenience
functions directly in libqtest, and update some call-sites that
can benefit from it.
Note that the new functions qtest_startf() and qtest_vstartf()
behave more like qtest_init() (the caller must assign global_qtest
after the fact, rather than getting it implicitly set). This helps
us prepare for future patches that get rid of the global variable,
by explicitly highlighting which tests still depend on it now.
Halil Pasic [Tue, 17 Oct 2017 14:04:53 +0000 (16:04 +0200)]
s390x: refactor error handling for MSCH handler
Simplify the error handling of the MSCH. Let the code detecting the
condition tell (in a less ambiguous way) how it's to be handled. No
changes in behavior.
Halil Pasic [Tue, 17 Oct 2017 14:04:52 +0000 (16:04 +0200)]
s390x: refactor error handling for HSCH handler
Simplify the error handling of the HSCH. Let the code detecting the
condition tell (in a less ambiguous way) how it's to be handled. No
changes in behavior.
Halil Pasic [Tue, 17 Oct 2017 14:04:51 +0000 (16:04 +0200)]
s390x: refactor error handling for CSCH handler
Simplify the error handling of the CSCH. Let the code detecting the
condition tell (in a less ambiguous way) how it's to be handled. No
changes in behavior.
Halil Pasic [Tue, 17 Oct 2017 14:04:50 +0000 (16:04 +0200)]
s390x: refactor error handling for XSCH handler
Simplify the error handling of the XSCH. Let the code detecting the
condition tell (in a less ambiguous way) how it's to be handled. No
changes in behavior.
Halil Pasic [Tue, 17 Oct 2017 14:04:49 +0000 (16:04 +0200)]
s390x: improve error handling for SSCH and RSCH
Simplify the error handling of the SSCH and RSCH handler avoiding
arbitrary and cryptic error codes being used to tell how the instruction
is supposed to end. Let the code detecting the condition tell how it's
to be handled in a less ambiguous way. It's best to handle SSCH and RSCH
in one go as the emulation of the two shares a lot of code.
For passthrough this change isn't pure refactoring, but changes the way
kernel reported EFAULT is handled. After clarifying the kernel interface
we decided that EFAULT shall be mapped to unit exception. Same goes for
unexpected error codes and absence of required ORB flags.
Halil Pasic [Tue, 17 Oct 2017 14:04:48 +0000 (16:04 +0200)]
s390x/css: IO instr handler ending control
CSS code needs to tell the IO instruction handlers located in ioinst.c
how the emulated instruction should be ended. Currently this is done by
returning generic (POSIX) error codes, and mapping them to outcomes like
condition codes. This makes bugs easy to create and hard to recognize.
As a preparation for moving away from (mis)using generic error codes for
flow control let us introduce a type which tells the instruction
handler function how to end the instruction, in a more straight-forward
and less ambiguous way.
Igor Mammedov [Tue, 17 Oct 2017 13:41:19 +0000 (15:41 +0200)]
s390x: fix cpu object referrence leak in s390x_new_cpu()
object_new() returns cpu with refcnt == 1 and after realize
refcnt == 2*. s390x_new_cpu() as an owner of the first refcnt
should have released it on exit in both cases (on error and
success) to avoid it leaking. Do so for both cases.
Cornelia Huck [Wed, 11 Oct 2017 13:39:53 +0000 (09:39 -0400)]
s390x/event-facility: variable-length event masks
The architecture supports masks of variable length for sclp write
event mask. We currently only support 4 byte event masks, as that
is what Linux uses.
Let's extend this to the maximum mask length supported by the
architecture and return 0 to the guest for the mask bits we don't
support in core.
target/s390x: special handling when starting a CPU with WAIT PSW
When we try to start a CPU with a WAIT PSW, we have to take care that
TCG will actually try to continue executing instructions.
We must therefore really only unhalt the CPU if we don't have a WAIT
PSW. Also document the special order for restart interrupts, which
load a new PSW and change the state to operating.
To keep KVM working, simply don't have a look at the WAIT bit when
loading the PSW. Otherwise the behavior of a restart interrupt when
a CPU stopped would be changed.
This effectively enables experimental SMP support. Floating interrupts are
still a mess, so allow it but print a big warning. There also seems
to be a problem with CPU hotplug (after the main loop started).
s390x/tcg: implement STOP and RESET interrupts for TCG
Implement them like KVM implements/handles them. Both can only be
triggered via SIGP instructions. RESET has (almost) the lowest priority if
the CPU is running, and the highest if the CPU is STOPPED. This is handled
in SIGP code already. On delivery, we only have to care about the
"CPU running" scenario.
STOP is defined to be delivered after all other interrupts have been
delivered. Therefore it has the actual lowest priority.
As both can wake up a CPU if sleeping, indicate them correctly to
external code (e.g. cpu_has_work()).
We want to use the same code base for TCG, so let's cleanly factor it
out.
The sigp mutex is currently not really needed, as everything is
protected by the iothread mutex. But this could change later, so leave
it in place and initialize it properly from common code.
target/s390x: interpret PSW_MASK_WAIT only for TCG
KVM handles the wait PSW itself and triggers a WAIT ICPT in case it
really wants to sleep (disabled wait).
This will later allow us to change the order of loading a restart
interrupt and setting a CPU to OPERATING on SIGP RESTART without
changing KVM behavior.
s390x/tcg: handle WAIT PSWs during interrupt injection
If we encounter a WAIT PSW, we have to halt immediately. Using
cpu_loop_exit() at this point feels wrong. Simply leaving
cs->exception_index set doesn't result in an immediate stop.
This is also necessary to properly handle SIGP STOP interrupts later.
The CPU_INTERRUPT_HALT will be processed immediately and properly set
the CPU to halted (also resetting cs->exception_index to EXCP_HLT)
s390x/tcg: a CPU cannot switch state due to an interrupt
Going to OPERATING here looks wrong. A CPU should even never be
!OPERATING at this point. Unhalting will already be done in
cpu_handle_halt() if there is work, so we can drop this statement
completely.