Eric Blake [Mon, 8 May 2017 17:13:02 +0000 (12:13 -0500)]
block: Tweak error message related to qemu-img amend
When converting a 1.1 image down to 0.10, qemu-iotests 060 forces
a contrived failure where allocating a cluster used to replace a
zero cluster reads unaligned data. Since it is a zero cluster
rather than a data cluster being converted, changing the error
message to match our earlier change in 'qcow2: Make distinction
between zero cluster types obvious' is worthwhile.
qemu-img: copy *key-secret opts when opening newly created files
The qemu-img dd/convert commands will create an image file and
then try to open it. Historically it has been possible to open
new files without passing any options. With encrypted files
though, the *key-secret options are mandatory, so we need to
provide those options when opening the newly created file.
qemu-img: introduce --target-image-opts for 'convert' command
The '--image-opts' flag indicates whether the source filename
includes options. The target filename has to remain in the
plain filename format though, since it needs to be passed to
bdrv_create(). When using --skip-create though, it would be
possible to use image-opts syntax. This adds --target-image-opts
to indicate that the target filename includes options. Currently
this mandates use of the --skip-create flag too.
The --image-opts flag can only be used to affect the parsing
of the source image. The target image has to be specified in
the traditional style regardless, since it needs to be passed
to the bdrv_create() API which does not support the new style
opts.
qemu-img: add support for --object with 'dd' command
The qemu-img dd command added --image-opts support, but missed
the corresponding --object support. This prevented passing
secrets (eg auth passwords) needed by certain disk images.
Alberto Garcia [Thu, 11 May 2017 15:03:37 +0000 (18:03 +0300)]
qcow2: remove extra local_error variable
Commit d7086422b1c1e75e320519cfe26176db6ec97a37 added a local_err
variable global to the qcow2_amend_options() function, so there's no
need to have this other one.
Kevin Wolf [Mon, 29 May 2017 12:08:32 +0000 (14:08 +0200)]
mirror: Drop permissions on s->target on completion
This fixes an assertion failure that was triggered by qemu-iotests 129
on some CI host, while the same test case didn't seem to fail on other
hosts.
Essentially the problem is that the blk_unref(s->target) in
mirror_exit() doesn't necessarily mean that the BlockBackend goes away
immediately. It is possible that the job completion was triggered nested
in mirror_drain(), which looks like this:
In this case, the write permissions for s->target are retained until
after blk_drain(), which makes removing mirror_top_bs fail for the
active commit case (can't have a writable backing file in the chain
without the filter driver).
Explicitly dropping the permissions first means that the additional
reference doesn't hurt and the job can complete successfully even if
called from the nested blk_drain().
Gerd Hoffmann [Fri, 19 May 2017 12:04:28 +0000 (14:04 +0200)]
ehci: fix frame timer invocation.
ehci registers ehci_frame_timer as both timer and bottom half, which
turned out to be a bad idea as it can be called as bottom half then
while it is running as timer, and it isn't prepared to handle recursive
calls.
Change the timer func to just schedule the bottom half to avoid this.
Ladi Prosek [Mon, 22 May 2017 12:33:25 +0000 (14:33 +0200)]
usb-hub: set PORT_STAT_C_SUSPEND on host-initiated wake-up
PORT_STAT_C_SUSPEND should be set even on host-initiated wake-up,
i.e. on ClearPortFeature(PORT_SUSPEND). Windows is known to not
work properly otherwise.
Side note, since PORT_ENABLE looks similar and might appear to
have the same issue: According to 11.24.2.7.2.2 C_PORT_ENABLE:
"This bit is set when the PORT_ENABLE bit changes from one to
zero as a result of a Port Error condition (see Section 11.8.1).
This bit is not set on any other changes to PORT_ENABLE."
Thomas Huth [Fri, 19 May 2017 07:00:04 +0000 (09:00 +0200)]
usb: Simplify the parameter parsing of the legacy usb serial device
Coverity complains about the current code, so let's get rid of
the now unneeded while loop and simply always emit "unrecognized
serial USB option" for all unsupported options.
Thomas Huth [Fri, 19 May 2017 06:35:16 +0000 (08:35 +0200)]
usb: Deprecate the legacy -usbdevice option
The '-usbdevice' option is considered as deprecated nowadays and
we might want to remove these options in a future version of QEMU.
So mark this options as deprecated in the documenation and print out
a warning if it is used to tell the user what to use instead.
While we're at it, improve also some other minor USB-related spots
in qemu-options.hx that were not up to date anymore.
Gerd Hoffmann [Mon, 15 May 2017 10:45:43 +0000 (12:45 +0200)]
ehci: fix overflow in frame timer code
In case the frame timer doesn't run for a while due to the host being
busy skipped_uframes can become big enough that UFRAME_TIMER_NS *
skipped_uframes overflows. Which in turn throws off all subsequent
ehci frame timer calculations.
Miloš Stojanović [Mon, 15 May 2017 14:59:49 +0000 (16:59 +0200)]
linux-user: add strace support for uinfo structure of rt_sigqueueinfo() and rt_tgsigqueueinfo()
This commit adds support for printing the content of the target_siginfo_t
structure in a similar way to how it is printed by the host strace. The
pointer to this structure is sent as the last argument of the
rt_sigqueueinfo() and rt_tgsigqueueinfo() system calls.
For this purpose, print_siginfo() is used and the get_target_siginfo()
function is implemented in order to get the information obtained from
the pointer into the form that print_siginfo() expects.
The get_target_siginfo() function is based on
host_to_target_siginfo_noswap() in linux-user mode, but here both
arguments are pointers to target_siginfo_t, so instead of converting
the information to siginfo_t it just extracts and copies it to a
target_siginfo_t structure.
Prior to this commit, typical strace output used to look like this:
8307 rt_sigqueueinfo(8307,50,0x00000040007ff6b0) = 0
After this commit, it looks like this:
8307 rt_sigqueueinfo(8307,50,{si_signo=50, si_code=SI_QUEUE, si_pid=8307,
si_uid=1000, si_sigval=17716762128}) = 0
Miloš Stojanović [Mon, 15 May 2017 14:59:48 +0000 (16:59 +0200)]
linux-user: fix inconsistent spaces in print_siginfo() output
This patch improves the consistentcy of the output from print_siginfo()
by removing spaces around the equal sign of si_pid, si_uid, si_timer1,
si_timer2, si_band, si_fd, si_addr, si_status and si_sigval. This way
they match si_signo and ci_code. Host strace was used as a reference
for this chage.
Prior to this commit, typical strace output used to look like this:
Miloš Stojanović [Mon, 15 May 2017 14:59:46 +0000 (16:59 +0200)]
linux-user: add support for rt_tgsigqueueinfo() system call
Add a new system call: rt_tgsigqueueinfo().
This system call is similar to rt_sigqueueinfo(), but instead of
sending the signal and data to the whole thread group with the ID
equal to the argument tgid, it sends it to a single thread within
that thread group. The ID of the thread is specified by the tid
argument.
The implementation is based on the rt_sigqueueinfo() in linux-user
mode, where the tid is added as the second argument and the
previous second and third argument become arguments three and four,
respectively.
Miloš Stojanović [Mon, 15 May 2017 14:59:43 +0000 (16:59 +0200)]
linux-user: fix ssetmask() system call
Fix the ssetmask() system call by removing the invocation of sigorset().
The ssetmask() system call should replace the old signal mask
with the new and return the old mask. It shouldn't combine
the old and the new mask with sigorset(). Fetching the old
mask for sigorset() is also no longer needed.
The problem was detected after running LTP test group syscalls
for the MIPS EL 32 R2 architecture where the test ssetmask01 failed
with exit code 1. The test passes now that the ssetmask() system call
is fixed.
Miloš Stojanović [Mon, 15 May 2017 14:59:42 +0000 (16:59 +0200)]
linux-user: add tkill(), tgkill() and rt_sigqueueinfo() strace
Improve strace support for syscall tkill(), tgkill() and rt_sigqueueinfo()
by implementing print functions that match arguments types of the system
calls and add them to the corresponding starce.list entry.
tkill:
Prior to this commit, typical strace output used to look like this:
4886 tkill(4886,50,0,4832615904,0,-9151031864016699136) = 0
After this commit, it looks like this:
4886 tkill(4886,50) = 0
tgkill:
Prior to this commit, typical strace output used to look like this:
4890 tgkill(4890,4890,50,8,4832630528,4832615904) = 0
After this commit, it looks like this:
4890 tgkill(4890,4890,50) = 0
rt_sigqueueinfo:
Prior to this commit, typical strace output used to look like this:
8307 rt_sigqueueinfo(8307,50,1996483164,0,0,50) = 0
After this commit, it looks like this:
8307 rt_sigqueueinfo(8307,50,0x00000040007ff6b0) = 0
Miloš Stojanović [Mon, 15 May 2017 14:59:41 +0000 (16:59 +0200)]
linux-user: add strace for getuid(), gettid(), getppid(), geteuid()
Improve strace support for syscalls getuid(), gettid(), getppid()
and geteuid(). Since these system calls don't have arguments, "%s()"
is added in the corresponding strace.list entry so that no arguments
are printed.
getuid:
Prior to this commit, typical strace output used to look like this:
4894 getuid(4894,0,0,274886293296,-3689348814741910323,4832615904) = 1000
After this commit, it looks like this:
4894 getuid() = 1000
gettid:
Prior to this commit, typical strace output used to look like this:
8307 gettid(0,0,64,0,4832630528,4832615840) = 8307
After this commit, it looks like this:
8307 gettid() = 8307
getppid:
Prior to this commit, typical strace output used to look like this:
20588 getppid(20588,64,0,4832630528,4832615888,0) = 20625
After this commit, it looks like this:
20588 getppid() = 20625
geteuid:
Prior to this commit, typical strace output used to look like this:
20588 geteuid(64,0,0,4832615888,0,-9151031864016699136) = 1000
After this commit, it looks like this:
20588 geteuid() = 1000
Andreas Schwab [Mon, 20 Mar 2017 11:31:55 +0000 (12:31 +0100)]
linux-user: remove all traces of qemu from /proc/self/cmdline
Instead of post-processing the real contents use the remembered target
argv. That removes all traces of qemu, including command line options,
and handles QEMU_ARGV0.
linux-user: allocate heap memory for execve arguments
Arguments passed to execve(2) call from user program could
be large, allocating stack memory for them via alloca(3) call
would lead to bad behaviour. Use 'g_new0' to allocate memory
for such arguments.
Laurent Vivier [Wed, 1 Mar 2017 09:37:48 +0000 (10:37 +0100)]
linux-user: fix eventfd
When a fd is opened using eventfd(), a read provides
a 64bit counter in the host byte order, and a
write increase the internal counter by the provided
64bit value.
Laurent Vivier [Wed, 1 Mar 2017 09:37:47 +0000 (10:37 +0100)]
linux-user: call fd_trans_target_to_host_data() for write()
As for sendmsg() or sendto(), we must call the target to
host data translator if it is defined. This is needed for
eventfd(): the write() syscall allows to add a value to
the internal counter, and so, it must be byte-swapped to
the host order.
Ladi Prosek [Thu, 25 May 2017 07:07:47 +0000 (09:07 +0200)]
pc: ACPI BIOS: use highest NUMA node for hotplug mem hole SRAT entry
For reasons unknown, Windows won't online all memory, both at command
line and hot-plugged later, unless the hotplug mem hole SRAT entry
specifies a node greater than or equal to the ones where memory is
added.
Using the highest node on the machine makes recent versions of Windows
happy.
With this example command line:
... \
-m 1024,slots=4,maxmem=32G \
-numa node,nodeid=0 \
-numa node,nodeid=1 \
-numa node,nodeid=2 \
-numa node,nodeid=3 \
-object memory-backend-ram,size=1G,id=mem-mem1 \
-device pc-dimm,id=dimm-mem1,memdev=mem-mem1,node=1
Windows reports a total of 1G of RAM without this commit and the expected
2G with this commit.
Sjors Gielen [Wed, 24 May 2017 17:51:12 +0000 (17:51 +0000)]
Fix total IP header length in forwarded TCP packets
When forwarding TCP packets, the internal tcpiphdr struct length was wrongly
used inside the IP header. This commit changes the behaviour to what is used
by tcp_output.c, using the correct full IP header + payload length.
Direct leak of 224 byte(s) in 1 object(s) allocated from:
#0 0x7f0f63cdee60 in malloc (/lib64/libasan.so.3+0xc6e60)
#1 0x556f11ff32d7 in tcp_newtcpcb /home/elmarco/src/qemu/slirp/tcp_subr.c:250
#2 0x556f11fdb1d1 in tcp_listen /home/elmarco/src/qemu/slirp/socket.c:688
#3 0x556f11fca9d5 in slirp_add_hostfwd /home/elmarco/src/qemu/slirp/slirp.c:1052
#4 0x556f11f8db41 in slirp_hostfwd /home/elmarco/src/qemu/net/slirp.c:506
#5 0x556f11f8dd83 in hmp_hostfwd_add /home/elmarco/src/qemu/net/slirp.c:535
There might be a better way to fix this, but calling slirp tcp_close()
doesn't work.
Stephen Bates [Tue, 16 May 2017 19:10:59 +0000 (13:10 -0600)]
nvme: Add support for Controller Memory Buffers
Implement NVMe Controller Memory Buffers (CMBs) which were added in
version 1.2 of the NVMe Specification. This patch adds an optional
argument (cmb_size_mb) which indicates the size of the CMB (in
MB). Currently only the Submission Queue Support (SQS) is enabled
which aligns with the current Linux driver for NVMe.
Kevin Wolf [Mon, 15 May 2017 12:36:23 +0000 (14:36 +0200)]
qemu-iotests: Test streaming with missing job ID
This adds a small test for the image streaming error path for failing
block_job_create(), which would have found the null pointer dereference
in commit a170a91f.
Alberto Garcia [Mon, 15 May 2017 09:34:24 +0000 (12:34 +0300)]
stream: fix crash in stream_start() when block_job_create() fails
The code that tries to reopen a BlockDriverState in stream_start()
when the creation of a new block job fails crashes because it attempts
to dereference a pointer that is known to be NULL.
Maxime Coquelin [Tue, 23 May 2017 12:31:19 +0000 (14:31 +0200)]
virtio_net: Bypass backends for MTU feature negotiation
This patch adds a new internal "x-mtu-bypass-backend" property
to bypass backends for MTU feature negotiation.
When this property is set, the MTU feature is negotiated as soon
as supported by the guest and a MTU value is set via the host_mtu
parameter. In case the backend advertises the feature (e.g. DPDK's
vhost-user backend), the feature negotiation is propagated down to
the backend.
When this property is not set, the backend has to support the MTU
feature for its negotiation to succeed.
For compatibility purpose, this property is disabled for machine
types v2.9 and older.
Peter Xu [Fri, 19 May 2017 03:19:47 +0000 (11:19 +0800)]
intel_iommu: support passthrough (PT)
Hardware support for VT-d device passthrough. Although current Linux can
live with iommu=pt even without this, but this is faster than when using
software passthrough.
Peter Xu [Fri, 19 May 2017 03:19:41 +0000 (11:19 +0800)]
memory: remove the last param in memory_region_iommu_replay()
We were always passing in that one as "false" to assume that's an read
operation, and we also assume that IOMMU translation would always have
that read permission. A better permission would be IOMMU_NONE since the
replay is after all not a real read operation, but just a page table
rebuilding process.
Peter Xu [Fri, 19 May 2017 03:19:40 +0000 (11:19 +0800)]
memory: tune last param of iommu_ops.translate()
This patch converts the old "is_write" bool into IOMMUAccessFlags. The
difference is that "is_write" can only express either read/write, but
sometimes what we really want is "none" here (neither read nor write).
Replay is an good example - during replay, we should not check any RW
permission bits since thats not an actual IO at all.
Greg Kurz [Thu, 25 May 2017 08:30:14 +0000 (10:30 +0200)]
9pfs: local: metadata file for the VirtFS root
When using the mapped-file security, credentials are stored in a metadata
directory located in the parent directory. This is okay for all paths with
the notable exception of the root path, since we don't want and probably
can't create a metadata directory above the virtfs directory on the host.
This patch introduces a dedicated metadata file, sitting in the virtfs root
for this purpose. It relies on the fact that the "." name necessarily refers
to the virtfs root.
As for the metadata directory, we don't want the client to see this file.
The current code only cares for readdir() but there are many other places
to fix actually. The filtering logic is hence put in a separate function.
Before:
# ls -ld
drwxr-xr-x. 3 greg greg 4096 May 5 12:49 .
# chown root.root .
chown: changing ownership of '.': Is a directory
# ls -ld
drwxr-xr-x. 3 greg greg 4096 May 5 12:49 .
After:
# ls -ld
drwxr-xr-x. 3 greg greg 4096 May 5 12:49 .
# chown root.root .
# ls -ld
drwxr-xr-x. 3 root root 4096 May 5 12:50 .
and from the host:
ls -al .virtfs_metadata_root
-rwx------. 1 greg greg 26 May 5 12:50 .virtfs_metadata_root
$ cat .virtfs_metadata_root
virtfs.uid=0
virtfs.gid=0
Greg Kurz [Thu, 25 May 2017 08:30:14 +0000 (10:30 +0200)]
9pfs: local: simplify file opening
The logic to open a path currently sits between local_open_nofollow() and
the relative_openat_nofollow() helper, which has no other user.
For the sake of clarity, this patch moves all the code of the helper into
its unique caller. While here we also:
- drop the code to skip leading "/" because the backend isn't supposed to
pass anything but relative paths without consecutive slashes. The assert()
is kept because we really don't want a buggy backend to pass an absolute
path to openat().
- use strchrnul() to get a simpler code. This is ok since virtfs is for
linux+glibc hosts only.
- don't dup() the initial directory and add an assert() to ensure we don't
return the global mountfd to the caller. BTW, this would mean that the
caller passed an empty path, which isn't supposed to happen either.
Greg Kurz [Thu, 25 May 2017 08:30:14 +0000 (10:30 +0200)]
9pfs: local: resolve special directories in paths
When using the mapped-file security mode, the creds of a path /foo/bar
are stored in the /foo/.virtfs_metadata/bar file. This is okay for all
paths unless they end with '.' or '..', because we cannot create the
corresponding file in the metadata directory.
This patch ensures that '.' and '..' are resolved in all paths.
The core code only passes path elements (no '/') to the backend, with
the notable exception of the '/' path, which refers to the virtfs root.
This patch preserves the current behavior of converting it to '.' so
that it can be passed to "*at()" syscalls ('/' would mean the host root).
Greg Kurz [Thu, 25 May 2017 08:30:14 +0000 (10:30 +0200)]
9pfs: check return value of v9fs_co_name_to_path()
These v9fs_co_name_to_path() call sites have always been around. I guess
no care was taken to check the return value because the name_to_path
operation could never fail at the time. This is no longer true: the
handle and synth backends can already fail this operation, and so will the
local backend soon.
Greg Kurz [Thu, 25 May 2017 08:30:14 +0000 (10:30 +0200)]
9pfs: assume utimensat() and futimens() are present
The utimensat() and futimens() syscalls have been around for ages (ie,
glibc 2.6 and linux 2.6.22), and the decision was already taken to
switch to utimensat() anyway when fixing CVE-2016-9602 in 2.9.
Greg Kurz [Thu, 25 May 2017 08:30:13 +0000 (10:30 +0200)]
fsdev: fix virtfs-proxy-helper cwd
Since chroot() doesn't change the current directory, it is indeed a good
practice to chdir() to the target directory and then then chroot(), or
to chroot() to the target directory and then chdir("/").
The current code does neither of them actually. Let's go for the latter.
This doesn't fix any security issue since all of this takes place before
the helper begins to process requests.
Greg Kurz [Thu, 25 May 2017 08:30:13 +0000 (10:30 +0200)]
9pfs: local: fix unlink of alien files in mapped-file mode
When trying to remove a file from a directory, both created in non-mapped
mode, the file remains and EBADF is returned to the guest.
This is a regression introduced by commit "df4938a6651b 9pfs: local:
unlinkat: don't follow symlinks" when fixing CVE-2016-9602. It changed the
way we unlink the metadata file from
ret = remove("$dir/.virtfs_metadata/$name");
if (ret < 0 && errno != ENOENT) {
/* Error out */
}
/* Ignore absence of metadata */
to
fd = openat("$dir/.virtfs_metadata")
unlinkat(fd, "$name")
if (ret < 0 && errno != ENOENT) {
/* Error out */
}
/* Ignore absence of metadata */
If $dir was created in non-mapped mode, openat() fails with ENOENT and
we pass -1 to unlinkat(), which fails in turn with EBADF.
We just need to check the return of openat() and ignore ENOENT, in order
to restore the behaviour we had with remove().
Signed-off-by: Greg Kurz <[email protected]> Reviewed-by: Eric Blake <[email protected]>
[groug: rewrote the comments as suggested by Eric]
hw/ppc/spapr.c: recover pending LMB unplug info in spapr_lmb_release
When a LMB hot unplug starts, the current DRC LMB status is stored at
spapr->pending_dimm_unplugs QTAILQ. This queue isn't migrated, thus
if a migration occurs in the middle of a LMB unplug the
spapr_lmb_release callback will lost track of the LMB unplug progress.
This patch implements a new recover function spapr_recover_pending_dimm_state
that is used inside spapr_lmb_release to recover this DRC LMB release
status that is lost during the migration.
Signed-off-by: Daniel Henrique Barboza <[email protected]>
[dwg: Minor stylistic changes, simplify error handling] Signed-off-by: David Gibson <[email protected]>
hw/ppc: migrating the DRC state of hotplugged devices
In pseries, a firmware abstraction called Dynamic Reconfiguration
Connector (DRC) is used to assign a particular dynamic resource
to the guest and provide an interface to manage configuration/removal
of the resource associated with it. In other words, DRC is the
'plugged state' of a device.
Before this patch, DRC wasn't being migrated. This causes
post-migration problems due to DRC state mismatch between source and
target. The DRC state of a device X in the source might
change, while in the target the DRC state of X is still fresh. When
migrating the guest, X will not have the same hotplugged state as it
did in the source. This means that we can't hot unplug X in the
target after migration is completed because its DRC state is not consistent.
https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1677552 is one
bug that is caused by this DRC state mismatch between source and
target.
To migrate the DRC state, we defined the VMStateDescription struct for
spapr_drc to enable the transmission of spapr_drc state in migration.
Not all the elements in the DRC state are migrated - only those
that can be modified by guest actions or device add/remove
operations:
- 'isolation_state', 'allocation_state' and 'indicator_state'
are involved in the DR state transition diagram from
PAPR+ 2.7, 13.4;
- 'configured', 'signalled', 'awaiting_release' and 'awaiting_allocation'
are needed in attaching and detaching devices;
- 'indicator_state' provides users with hardware state information.
These are the DRC elements that are migrated.
In this patch the DRC state is migrated for PCI, LMB and CPU
connector types. At this moment there is no support to migrate
DRC for the PHB (PCI Host Bridge) type.
In the 'realize' function the DRC is registered using vmstate_register,
similar to what hw/ppc/spapr_iommu.c does in 'spapr_tce_table_realize'.
This approach works because DRCs are bus-less and do not sit
on a BusClass that implements bc->get_dev_path, so as a fallback the
VMSD gets identified via "spapr_drc"/get_index(drc).
hw/ppc: removing drc->detach_cb and drc->detach_cb_opaque
The pointer drc->detach_cb is being used as a way of informing
the detach() function inside spapr_drc.c which cb to execute. This
information can also be retrieved simply by checking drc->type and
choosing the right callback based on it. In this context, detach_cb
is redundant information that must be managed.
After the previous spapr_lmb_release change, no detach_cb_opaques
are being used by any of the three callbacks functions. This is
yet another information that is now unused and, on top of that, can't
be migrated either.
This patch makes the following changes:
- removal of detach_cb_opaque. the 'opaque' argument was removed from
the callbacks and from the detach() function of sPAPRConnectorClass. The
attribute detach_cb_opaque of sPAPRConnector was removed.
- removal of detach_cb from the detach() call. The function pointer
detach_cb of sPAPRConnector was removed. detach() now uses a
switch(drc->type) to execute the apropriate callback. To achieve this,
spapr_core_release, spapr_lmb_release and spapr_phb_remove_pci_device_cb
callbacks were made public to be visible inside detach().
David Gibson [Wed, 24 May 2017 07:01:48 +0000 (17:01 +1000)]
hw/ppc/spapr.c: adding pending_dimm_unplugs to sPAPRMachineState
The LMB DRC release callback, spapr_lmb_release(), uses an opaque
parameter, a sPAPRDIMMState struct that stores the current LMBs that
are allocated to a DIMM (nr_lmbs). After each call to this callback,
the nr_lmbs is decremented by one and, when it reaches zero, the callback
proceeds with the qdev calls to hot unplug the LMB.
Using drc->detach_cb_opaque is problematic because it can't be migrated in
the future DRC migration work. This patch makes the following changes to
eliminate the usage of this opaque callback inside spapr_lmb_release:
- sPAPRDIMMState was moved from spapr.c and added to spapr.h. A new
attribute called 'addr' was added to it. This is used as an unique
identifier to associate a sPAPRDIMMState to a PCDIMM element.
- sPAPRMachineState now hosts a new QTAILQ called 'pending_dimm_unplugs'.
This queue of sPAPRDIMMState elements will store the DIMM state of DIMMs
that are currently going under an unplug process.
- spapr_lmb_release() will now retrieve the nr_lmbs value by getting the
correspondent sPAPRDIMMState. A helper function called spapr_dimm_get_address
was created to fetch the address of a PCDIMM device inside spapr_lmb_release.
When nr_lmbs reaches zero and the callback proceeds with the qdev hot unplug
calls, the sPAPRDIMMState struct is removed from spapr->pending_dimm_unplugs.
After these changes, the opaque argument for spapr_lmb_release is now
unused and is passed as NULL inside spapr_del_lmbs. This and the other
opaque arguments can now be safely removed from the code.
As an additional cleanup made by this patch, the spapr_del_lmbs function
was merged with spapr_memory_unplug_request. The former was being called
only by the latter and both were small enough to fit one single function.
Signed-off-by: Daniel Henrique Barboza <[email protected]>
[dwg: Minor stylistic cleanups] Signed-off-by: David Gibson <[email protected]>
Jeff Cody [Tue, 23 May 2017 17:27:50 +0000 (13:27 -0400)]
block/gluster: glfs_lseek() workaround
On current released versions of glusterfs, glfs_lseek() will sometimes
return invalid values for SEEK_DATA or SEEK_HOLE. For SEEK_DATA and
SEEK_HOLE, the returned value should be >= the passed offset, or < 0 in
the case of error:
LSEEK(2):
off_t lseek(int fd, off_t offset, int whence);
[...]
SEEK_HOLE
Adjust the file offset to the next hole in the file greater
than or equal to offset. If offset points into the middle of
a hole, then the file offset is set to offset. If there is no
hole past offset, then the file offset is adjusted to the end
of the file (i.e., there is an implicit hole at the end of
any file).
[...]
RETURN VALUE
Upon successful completion, lseek() returns the resulting
offset location as measured in bytes from the beginning of the
file. On error, the value (off_t) -1 is returned and errno is
set to indicate the error
However, occasionally glfs_lseek() for SEEK_HOLE/DATA will return a
value less than the passed offset, yet greater than zero.
For instance, here are example values observed from this call:
offs = glfs_lseek(s->fd, start, SEEK_HOLE);
if (offs < 0) {
return -errno; /* D1 and (H3 or H4) */
}
This causes QEMU to abort on the assert test. When this value is
returned, errno is also 0.
This is a reported and known bug to glusterfs:
https://bugzilla.redhat.com/show_bug.cgi?id=1425293
Although this is being fixed in gluster, we still should work around it
in QEMU, given that multiple released versions of gluster behave this
way.
This patch treats the return case of (offs < start) the same as if an
error value other than ENXIO is returned; we will assume we learned
nothing, and there are no holes in the file.
Paolo Bonzini [Mon, 8 May 2017 14:13:10 +0000 (16:13 +0200)]
blockjob: use deferred_to_main_loop to indicate the coroutine has ended
All block jobs are using block_job_defer_to_main_loop as the final
step just before the coroutine terminates. At this point,
block_job_enter should do nothing, but currently it restarts
the freed coroutine.
Now, the job->co states should probably be changed to an enum
(e.g. BEFORE_START, STARTED, YIELDED, COMPLETED) subsuming
block_job_started, job->deferred_to_main_loop and job->busy.
For now, this patch eliminates the problematic reenter by
removing the reset of job->deferred_to_main_loop (which served
no purpose, as far as I could see) and checking the flag in
block_job_enter.
This splits the part that touches job states from the part that invokes
callbacks. It will make the code simpler to understand once job states will
be protected by a different mutex than the AioContext lock.
Paolo Bonzini [Mon, 8 May 2017 14:13:08 +0000 (16:13 +0200)]
blockjob: strengthen a bit test-blockjob-txn
Unlike test-blockjob-txn, QMP releases the reference to the transaction
before the jobs finish. Thus, qemu-iotest 124 showed a failure while
working on the next patch that the unit tests did not have. Make
the test a little nastier.
The new functions helps respecting the invariant that the coroutine
is entered with false user_resume, zero pause count and no error
recorded in the iostatus.
Resetting the iostatus is now common to all of block_job_cancel_async,
block_job_user_resume and block_job_iostatus_reset, albeit with slight
differences:
- block_job_cancel_async resets the iostatus, and resumes the job if
there was an error, but the coroutine is not restarted immediately.
For example the caller may continue with a call to block_job_finish_sync.
- block_job_user_resume resets the iostatus. It wants to resume the job
unconditionally, even if there was no error.
- block_job_iostatus_reset doesn't resume the job at all. Maybe that's
a bug but it should be fixed separately.
block_job_iostatus_reset does the least common denominator, so add some
checking but otherwise leave it as the entry point for resetting the
iostatus.
Outside blockjob.c, the block_job_iostatus_reset function is used once
in the monitor and once in BlockBackend. When we introduce the block
job mutex, block_job_iostatus_reset's client is going to be the block
layer (for which blockjob.c will take the block job mutex) rather than
the monitor (which will take the block job mutex by itself).
The monitor's call to block_job_iostatus_reset from the monitor comes
just before the sole call to block_job_user_resume, so reset the
iostatus directly from block_job_iostatus_reset. This will avoid
the need to introduce separate block_job_iostatus_reset and
block_job_iostatus_reset_locked APIs.
After making this change, move the function together with the others
that were moved in the previous patch.
Paolo Bonzini [Mon, 8 May 2017 14:13:04 +0000 (16:13 +0200)]
blockjob: separate monitor and blockjob APIs
We have two different headers for block job operations, blockjob.h
and blockjob_int.h. The former contains APIs called by the monitor,
the latter contains APIs called by the block job drivers and the
block layer itself.
Keep the two APIs separate in the blockjob.c file too. This will
be useful when transitioning away from the AioContext lock, because
there will be locking policies for the two categories, too---the
monitor will have to call new block_job_lock/unlock APIs, while blockjob
APIs will take care of this for the users.
Paolo Bonzini [Mon, 8 May 2017 14:13:03 +0000 (16:13 +0200)]
blockjob: introduce block_job_pause/resume_all
Remove use of block_job_pause/resume from outside blockjob.c, thus
making them static. The new functions are used by the block layer,
so place them in blockjob_int.h.
Paolo Bonzini [Mon, 8 May 2017 14:13:02 +0000 (16:13 +0200)]
blockjob: introduce block_job_early_fail
Outside blockjob.c, block_job_unref is only used when a block job fails
to start, and block_job_ref is not used at all. The reference counting
thus is pretty well hidden. Introduce a separate function to be used
by block jobs; because block_job_ref and block_job_unref now become
static, move them earlier in blockjob.c.
* cohuck/tags/s390x-20170523: (21 commits)
s390/kvm: do not reset riccb on initial cpu reset
MAINTAINERS: Add vfio-ccw maintainer
vfio/ccw: update sense data if a unit check is pending
s390x/css: ccw translation infrastructure
s390x/css: introduce and realize ccw-request callback
vfio/ccw: get irqs info and set the eventfd fd
vfio/ccw: get io region info
vfio/ccw: vfio based subchannel passthrough driver
s390x/css: device support for s390-ccw passthrough
s390x/css: realize css_create_sch
s390x/css: realize css_sch_build_schib
s390x/css: add s390-squash-mcss machine option
linux-headers: update
pc-bios/s390-ccw.img: rebuild image
pc-bios/s390-ccw: Build a reasonable max_sectors limit
pc-bios/s390-ccw: Get Block Limits VPD device data
pc-bios/s390-ccw: Get list of supported VPD pages
pc-bios/s390-ccw: Refactor scsi_inquiry function
pc-bios/s390-ccw: Break up virtio-scsi read into multiples
pc-bios/s390-ccw: Move SCSI block factor to outer read
...
David Gibson [Tue, 23 May 2017 06:33:06 +0000 (16:33 +1000)]
pseries: Restore support for total vcpus not a multiple of threads-per-core for old machine types
As of pseries-2.7 and later, we require the total number of guest vcpus to
be a multiple of the threads-per-core. pseries-2.6 and earlier machine
types, however, are supposed to allow this for the sake of migration from
old qemu versions which allowed this.
Unfortunately, 8149e29 "pseries: Enforce homogeneous threads-per-core"
broke this by not considering the old machine type case. This fixes it by
only applying the check when the machine type supports hotpluggable cpus.
By not-entirely-coincidence, that corresponds to the same time when we
started enforcing total threads being a multiple of threads-per-core.
David Gibson [Thu, 18 May 2017 04:47:44 +0000 (14:47 +1000)]
pseries: Split CAS PVR negotiation out into a separate function
Guests of the qemu machine type go through a feature negotiation process
known as "client architecture support" (CAS) during early boot. This does
a number of things, one of which is finding a CPU compatibility mode which
can be supported by both guest and host.
In fact the CPU negotiation is probably the single most complex part of the
CAS process, so this splits it out into a helper function. We've recently
made some mistakes in maintaining backward compatibility for old machine
types here. Splitting this out will also make it easier to fix this.
This also adds a possibly useful error message if the negotiation fails
(i.e. if there isn't a CPU mode that's suitable for both guest and host).
Greg Kurz [Fri, 19 May 2017 10:32:12 +0000 (12:32 +0200)]
spapr: fix error reporting in xics_system_init()
If the user explicitely asked for kernel-irqchip support and "xics-kvm"
initialization fails, we shouldn't fallback to emulated "xics" as we
do now. It is also awkward to print an error message when we have an
errp pointer argument.
Let's use the errp argument to report the error and let the caller decide.
This simplifies the code as we don't need a local Error * here.
Greg Kurz [Fri, 19 May 2017 10:32:04 +0000 (12:32 +0200)]
spapr_cpu_core: drop reference on ICP object during CPU realization
When a piece of code allocates an object, it implicitely gets a reference
on it. If it then makes that object a child property of another object, it
should drop its own reference at some point otherwise the child object can
never be finalized. The current code hence leaks one ICP object per CPU
when hot-removing a core.
Failing to add a newly allocated ICP object to the CPU is a bug. While here,
let's ensure QEMU aborts if this ever happens.
hw/ppc/spapr_events.c: removing 'exception' from sPAPREventLogEntry
Currenty we do not have any RTAS event that is reported by the
event-scan interface. The existing events, RTAS_LOG_TYPE_EPOW and
RTAS_LOG_TYPE_HOTPLUG, are being reported by the check-exception
interface and, as such, marked as 'exception=true'.
Commit 79853e18d9, 'spapr_events: event-scan RTAS interface', added
the event_scan interface because the guest kernel requires it to
initialize other required interfaces. It is acting since then as
a stub because no events that would be reported by it were added
since then. However, the existence of the 'exception' boolean adds
an unnecessary load in the future migration of the pending_events,
sPAPREventLogEntry QTAILQ that hosts the pending RTAS events.
To make the code cleaner and ease the future migration changes, this
patch makes the following changes:
- remove the 'exception' boolean that filter these events. There is
nothing to filter since all events are reported by check-exception;
- functions rtas_event_log_queue, rtas_event_log_dequeue and
rtas_event_log_contains don't receive the 'exception' boolean
as parameter;
- event_scan function was simplified. It was calling
'rtas_event_log_dequeue(mask, false)' that was always returning
'NULL' because we have no events that are created with
exception=false, thus in the end it would execute a jump to
'out_no_events' all the time. The function now assumes that
this will always be the case and all the remaining logic were
deleted.
In the future, when or if we add new RTAS events that should
be reported with the event_scan interface, we can refer to
the changes made in this patch to add the event_scan logic
back.
Greg Kurz [Wed, 17 May 2017 14:38:20 +0000 (16:38 +0200)]
xics_kvm: cache already enabled vCPU ids
Since commit a45863bda90d ("xics_kvm: Don't enable KVM_CAP_IRQ_XICS if
already enabled"), we were able to re-hotplug a vCPU that had been hot-
unplugged ealier, thanks to a boolean flag in ICPState that we set when
enabling KVM_CAP_IRQ_XICS.
This could work because the lifecycle of all ICPState objects was the
same as the machine. Commit 5bc8d26de20c ("spapr: allocate the ICPState
object from under sPAPRCPUCore") broke this assumption and now we always
pass a freshly allocated ICPState object (ie, with the flag unset) to
icp_kvm_cpu_setup().
This cause re-hotplug to fail with:
Unable to connect CPU8 to kernel XICS: Device or resource busy
Let's fix this by caching all the vCPU ids for which KVM_CAP_IRQ_XICS was
enabled. This also drops the now useless boolean flag from ICPState.
Greg Kurz [Mon, 15 May 2017 11:39:45 +0000 (13:39 +0200)]
spapr: sanitize error handling in spapr_ics_create()
The spapr_ics_create() function handles errors in a rather convoluted
way, with two local Error * variables. Moreover, failing to parent the
ICS object to the machine should be considered as a bug but it is
currently ignored.
Stefan Hajnoczi [Tue, 23 May 2017 13:53:41 +0000 (14:53 +0100)]
Merge remote-tracking branch 'jasowang/tags/net-pull-request' into staging
# gpg: Signature made Tue 23 May 2017 03:27:37 AM BST
# gpg: using RSA key 0xEF04965B398D6211
# gpg: Good signature from "Jason Wang (Jason Wang on RedHat) <[email protected]>"
# Primary key fingerprint: 215D 46F4 8246 689E C77F 3562 EF04 965B 398D 6211
* jasowang/tags/net-pull-request:
e1000e: Fix ICR "Other" causes clear logic
net/filter-rewriter: Remove unused option in filter-rewriter
net/filter-mirror.c: Rename filter_mirror_send() and fix codestyle
net/filter-mirror.c: Remove duplicate check code.
hmp / net: Mark host_net_add/remove as deprecated
COLO-compare: Improve tcp compare trace event readability
virtio-net: fix wild pointer when remove virtio-net queues
net/dump: Issue a warning for the deprecated "-net dump"
net/tap: Replace tap-haiku.c and tap-aix.c by a generic tap-stub.c
Eduardo Habkost [Tue, 16 May 2017 20:53:51 +0000 (17:53 -0300)]
qapi-schema: Remove obsolete note from ObjectTypeInfo
The "This command is experimental" note in ObjectTypeInfo is obsolete
since 2012. Commit 5192082097549c5b3aa7c913c6853d97a68172cb removed the
warning from the qom-list-types command documentation, but we forgot to
remove the warning from ObjectTypeInfo.
Eric Blake [Mon, 15 May 2017 19:54:39 +0000 (14:54 -0500)]
block: Use QDict helpers for --force-share
Fam's addition of --force-share in commits 459571f7 and 335e9937
were developed prior to the addition of QDict scalar insertion
macros, but merged after the general cleanup in commit 46f5ac20.
Patch created mechanically by rerunning:
Eric Blake [Mon, 15 May 2017 21:41:14 +0000 (16:41 -0500)]
shutdown: Expose bool cause in SHUTDOWN and RESET events
Libvirt would like to be able to distinguish between a SHUTDOWN
event triggered solely by guest request and one triggered by a
SIGTERM or other action on the host. While qemu_kill_report() was
already able to give different output to stderr based on whether a
shutdown was triggered by a host signal (but NOT by a host UI event,
such as clicking the X on the window), that information was then
lost to management. The previous patches improved things to use an
enum throughout all callsites, so now we have something ready to
expose through QMP.
Note that for now, the decision was to expose ONLY a boolean,
rather than promoting ShutdownCause to a QAPI enum; this is because
libvirt has not expressed an interest in anything finer-grained.
We can still add additional details, in a backwards-compatible
manner, if a need later arises (if the addition happens before 2.10,
we can replace the bool with an enum; otherwise, the enum will have
to be in addition to the bool); this patch merely adds a helper
shutdown_caused_by_guest() to map the internal enum into the
external boolean.
Update expected iotest outputs to match the new data (complete
coverage of the affected tests is obtained by -raw, -qcow2, and -nbd).
Here is output from 'virsh qemu-monitor-event --loop' with the
patch installed:
event SHUTDOWN at 1492639680.731251 for domain fedora_13: {"guest":true}
event STOP at 1492639680.732116 for domain fedora_13: <null>
event SHUTDOWN at 1492639680.732830 for domain fedora_13: {"guest":false}
Note that libvirt runs qemu with -no-shutdown: the first SHUTDOWN event
was triggered by an action I took directly in the guest (shutdown -h),
at which point qemu stops the vcpus and waits for libvirt to do any
final cleanups; the second SHUTDOWN event is the result of libvirt
sending SIGTERM now that it has completed cleanup. Libvirt is already
smart enough to only feed the first qemu SHUTDOWN event to the end user
(remember, virsh qemu-monitor-event is a low-level debugging interface
that is explicitly unsupported by libvirt, so it sees things that normal
end users do not); changing qemu to emit SHUTDOWN only once is outside
the scope of this series.
Eric Blake [Mon, 15 May 2017 21:41:13 +0000 (16:41 -0500)]
shutdown: Add source information to SHUTDOWN and RESET
Time to wire up all the call sites that request a shutdown or
reset to use the enum added in the previous patch.
It would have been less churn to keep the common case with no
arguments as meaning guest-triggered, and only modified the
host-triggered code paths, via a wrapper function, but then we'd
still have to audit that I didn't miss any host-triggered spots;
changing the signature forces us to double-check that I correctly
categorized all callers.
Since command line options can change whether a guest reset request
causes an actual reset vs. a shutdown, it's easy to also add the
information to reset requests.