I believe that the separate allocation of DisasFields from DisasContext
was meant to limit the places from which we could access fields. But
that plan did not go unchanged, and since DisasContext contains a pointer
to fields, the substructure is accessible everywhere.
By allocating the substructure with DisasContext, we improve the locality
of the accesses by avoiding one level of pointer chasing. In addition,
we avoid a dangling pointer to stack allocated memory, diagnosed by static
checkers.
Thomas Huth [Wed, 22 Jan 2020 10:14:37 +0000 (11:14 +0100)]
target/s390x/kvm: Enable adapter interruption suppression again
The AIS feature has been disabled late in the v2.10 development cycle since
there were some issues with migration (see commit 3f2d07b3b01ea61126b -
"s390x/ais: for 2.10 stable: disable ais facility"). We originally wanted
to enable it again for newer machine types, but apparently we forgot to do
this so far. Let's do it now for the machines that support proper CPU models.
Commit ae71ed8610 replaced the use of global max_cpus variable
with a machine property, but introduced a unnecessary ifdef, as
this block is already in the 'not CONFIG_USER_ONLY' branch part:
Cornelia Huck [Tue, 21 Jan 2020 09:41:00 +0000 (10:41 +0100)]
s390x/event-facility: fix error propagation
We currently check (by error) if the passed-in Error pointer errp
is non-null and return after realizing the first child of the
event facility in that case. Symptom is that 'virsh shutdown'
does not work, as the sclpquiesce device is not realized.
Fix this by (correctly) checking the local Error err.
Cornelia Huck [Thu, 16 Jan 2020 12:10:35 +0000 (13:10 +0100)]
s390x: adapter routes error handling
If the kernel irqchip has been disabled, we don't want the
{add,release}_adapter_routes routines to call any kvm_irqchip_*
interfaces, as they may rely on an irqchip actually having been
created. Just take a quick exit in that case instead. If you are
trying to use irqfd without a kernel irqchip, we will fail with
an error.
Also initialize routes->gsi[] with -1 in the virtio-ccw handling,
to make sure we don't trip over other errors, either. (Nobody
else uses the gsi array in that structure.)
'out' label from write_event_mask() and write_event_data()
can be replaced by 'return'.
The 'out' label from read_event_data() can also be replaced.
However, as suggested by Cornelia Huck, instead of simply
replacing the 'out' label, let's also change the code flow
a bit to make it clearer that sccb events are always handled
regardless of the mask for unconditional reads, while selective
reads are handled if the mask is valid.
s390x/sclp.c: remove unneeded label in sclp_service_call()
'out' label can be replaced by 'return' with the appropriate
value. The 'r' integer, which is used solely to set the
return value for this label, can also be removed.
Peter Maydell [Mon, 27 Jan 2020 09:44:03 +0000 (09:44 +0000)]
Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging
* Register qdev properties as class properties (Marc-André)
* Cleanups (Philippe)
* virtio-scsi fix (Pan Nengyuan)
* Tweak Skylake-v3 model id (Kashyap)
* x86 UCODE_REV support and nested live migration fix (myself)
* Advisory mode for pvpanic (Zhenwei)
* remotes/bonzini/tags/for-upstream: (58 commits)
build-sys: clean up flags included in the linker command line
target/i386: Add the 'model-id' for Skylake -v3 CPU models
qdev: use object_property_help()
qapi/qmp: add ObjectPropertyInfo.default-value
qom: introduce object_property_help()
qom: simplify qmp_device_list_properties()
vl: print default value in object help
qdev: register properties as class properties
qdev: move instance properties to class properties
qdev: rename DeviceClass.props
qdev: set properties with device_class_set_props()
object: return self in object_ref()
object: release all props
object: add object_class_property_add_link()
object: express const link with link property
object: add direct link flag
object: rename link "child" to "target"
object: check strong flag with &
object: do not free class properties
object: add object_property_set_default
...
Paolo Bonzini [Wed, 11 Dec 2019 14:34:27 +0000 (15:34 +0100)]
build-sys: clean up flags included in the linker command line
Some of the CFLAGS that are discovered during configure, for example
compiler warnings, are being included on the linker command line because
QEMU_CFLAGS is added to it. Other flags, such as the -m32, appear twice
because they are included in both QEMU_CFLAGS and LDFLAGS. All this
leads to confusion with respect to what goes in which Makefile variables
(and we have plenty).
So, introduce QEMU_LDFLAGS for flags discovered by configure, following
the lead of QEMU_CFLAGS, and stop adding to it:
1) options that are already in CFLAGS, for example "-g"
2) duplicate options
At the same time, options that _are_ needed by both compiler and linker
must now be added to both QEMU_CFLAGS and QEMU_LDFLAGS, which is clearer.
This is mostly -fsanitize options. For now, --extra-cflags has this behavior
(but --extra-cxxflags does not).
Meson will not include CFLAGS on the linker command line, do the same in our
build system as well.
target/i386: Add the 'model-id' for Skylake -v3 CPU models
This fixes a confusion in the help output. (Although, if you squint
long enough at the '-cpu help' output, you _do_ notice that
"Skylake-Client-noTSX-IBRS" is an alias of "Skylake-Client-v3";
similarly for Skylake-Server-v3.)
Without this patch:
$ qemu-system-x86 -cpu help
...
x86 Skylake-Client-v1 Intel Core Processor (Skylake)
x86 Skylake-Client-v2 Intel Core Processor (Skylake, IBRS)
x86 Skylake-Client-v3 Intel Core Processor (Skylake, IBRS)
...
x86 Skylake-Server-v1 Intel Xeon Processor (Skylake)
x86 Skylake-Server-v2 Intel Xeon Processor (Skylake, IBRS)
x86 Skylake-Server-v3 Intel Xeon Processor (Skylake, IBRS)
...
With this patch:
$ ./qemu-system-x86 -cpu help
...
x86 Skylake-Client-v1 Intel Core Processor (Skylake)
x86 Skylake-Client-v2 Intel Core Processor (Skylake, IBRS)
x86 Skylake-Client-v3 Intel Core Processor (Skylake, IBRS, no TSX)
...
x86 Skylake-Server-v1 Intel Xeon Processor (Skylake)
x86 Skylake-Server-v2 Intel Xeon Processor (Skylake, IBRS)
x86 Skylake-Server-v3 Intel Xeon Processor (Skylake, IBRS, no TSX)
...
Use class properties facilities to add properties to the class during
device_class_set_props().
qdev_property_add_static() must be adapted as PropertyInfo now
operates with classes (and not instances), so we must
set_default_value() on the ObjectProperty, before calling its init()
method on the object instance.
Also, PropertyInfo.create() is now exclusively used for class
properties. Fortunately, qdev_property_add_static() is only used in
target/arm/cpu.c so far, which doesn't use "link" properties (that
require create()).
Class properties may have to release resources when the object is
destroyed. Let's use the existing release() callback for that, but
class properties must not release ObjectProperty, as it can be shared
by various instances.
The release callback is called during object_property_del_all(), on a
live instance. But class properties are common among all
instances. It is not currently called, because we don't release
classes, but it would not be correct if we did.
Add a default value to ObjectProperty and an implementation of
ObjectPropertyInit that uses it. This will make it easier to show the
default in help messages.
Also provide convenience functions object_property_set_default_{bool,
str, int, uint}().
Commit af0440ae852 moved the qemu_tcg_configure() function,
but introduced extraneous 'include/' in the includes path.
As it is not necessary, remove it.
The accel/ code only accesses the MachineState::accel field.
As we simply want to access the accelerator, not the machine,
add a current_accel() wrapper.
Paolo Bonzini [Mon, 20 Jan 2020 18:21:42 +0000 (19:21 +0100)]
target/i386: kvm: initialize feature MSRs very early
Some read-only MSRs affect the behavior of ioctls such as
KVM_SET_NESTED_STATE. We can initialize them once and for all
right after the CPU is realized, since they will never be modified
by the guest.
hw/core/Makefile: Group generic objects versus system-mode objects
To ease review/modifications of this Makefile, group generic
objects first, then system-mode specific ones, and finally
peripherals (which are only used in system-mode).
Makefile: Clarify all the codebase requires qom/ objects
QEMU user-mode also requires the qom/ objects, it is not only
used by "system emulation and qemu-img". As we will use a big
if() block, move it upper in the "Common libraries for tools
and emulators" section.
We only require libfdt for system emulation, in a small set
of architecture:
4077 # fdt support is mandatory for at least some target architectures,
4078 # so insist on it if we're building those system emulators.
4079 fdt_required=no
4080 for target in $target_list; do
4081 case $target in
4082 aarch64*-softmmu|arm*-softmmu|ppc*-softmmu|microblaze*-softmmu|mips64el-softmmu|riscv*-softmmu)
4083 fdt_required=yes
Do not build libfdt if we did not manually specified --enable-fdt,
or have one of the platforms that require it in our target list.
GCC9 is confused by this comment when building with CFLAG
-Wimplicit-fallthrough=2:
hw/net/imx_fec.c: In function ‘imx_eth_write’:
hw/net/imx_fec.c:906:12: error: this statement may fall through [-Werror=implicit-fallthrough=]
906 | if (unlikely(single_tx_ring)) {
| ^
hw/net/imx_fec.c:912:5: note: here
912 | case ENET_TDAR: /* FALLTHROUGH */
| ^~~~
cc1: all warnings being treated as errors
Rewrite the comments in the correct place, using 'fall through'
which is recognized by GCC and static analyzers.
Reported by GCC9 when building with CFLAG -Wimplicit-fallthrough=2:
hw/timer/aspeed_timer.c: In function ‘aspeed_timer_set_value’:
hw/timer/aspeed_timer.c:283:24: error: this statement may fall through [-Werror=implicit-fallthrough=]
283 | if (old_reload || !t->reload) {
| ~~~~~~~~~~~^~~~~~~~~~~~~
hw/timer/aspeed_timer.c:287:5: note: here
287 | case TIMER_REG_STATUS:
| ^~~~
cc1: all warnings being treated as errors
When building with GCC9 using CFLAG -Wimplicit-fallthrough=2 we get:
hw/display/tcx.c: In function ‘tcx_dac_writel’:
hw/display/tcx.c:453:26: error: this statement may fall through [-Werror=implicit-fallthrough=]
453 | s->dac_index = (s->dac_index + 1) & 0xff; /* Index autoincrement */
| ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
hw/display/tcx.c:454:9: note: here
454 | default:
| ^~~~~~~
hw/display/tcx.c: In function ‘tcx_dac_readl’:
hw/display/tcx.c:412:22: error: this statement may fall through [-Werror=implicit-fallthrough=]
412 | s->dac_index = (s->dac_index + 1) & 0xff; /* Index autoincrement */
| ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~
hw/display/tcx.c:413:5: note: here
413 | default:
| ^~~~~~~
cc1: all warnings being treated as errors
Give a hint to GCC by adding the missing fall through comments.
When building with GCC9 using CFLAG -Wimplicit-fallthrough=2 we get:
audio/audio.c: In function ‘audio_pcm_init_info’:
audio/audio.c:306:14: error: this statement may fall through [-Werror=implicit-fallthrough=]
306 | sign = 1;
| ~~~~~^~~
audio/audio.c:307:5: note: here
307 | case AUDIO_FORMAT_U8:
| ^~~~
cc1: all warnings being treated as errors
Similarly to e46349414, add the missing fall through comment to
hint GCC.
qom/object: Display more helpful message when an interface is missing
When adding new devices implementing QOM interfaces, we might
forgot to add the Kconfig dependency that pulls the required
objects in when building.
Since QOM dependencies are resolved at runtime, we don't get any
link-time failures, and QEMU aborts while starting:
$ qemu ...
Segmentation fault (core dumped)
(gdb) bt
#0 0x00007ff6e96b1e35 in raise () from /lib64/libc.so.6
#1 0x00007ff6e969c895 in abort () from /lib64/libc.so.6
#2 0x00005572bc5051cf in type_initialize (ti=0x5572be6f1200) at qom/object.c:323
#3 0x00005572bc505074 in type_initialize (ti=0x5572be6f1800) at qom/object.c:301
#4 0x00005572bc505074 in type_initialize (ti=0x5572be6e48e0) at qom/object.c:301
#5 0x00005572bc506939 in object_class_by_name (typename=0x5572bc56109a) at qom/object.c:959
#6 0x00005572bc503dd5 in cpu_class_by_name (typename=0x5572bc56109a, cpu_model=0x5572be6d9930) at hw/core/cpu.c:286
Since the caller has access to the qdev parent/interface names,
we can simply display them to avoid starting a debugger:
zhenwei pi [Tue, 14 Jan 2020 02:31:01 +0000 (10:31 +0800)]
pvpanic: introduce crashloaded for pvpanic
Add bit 1 for pvpanic. This bit means that guest hits a panic, but
guest wants to handle error by itself. Typical case: Linux guest runs
kdump in panic. It will help us to separate the abnormal reboot from
normal operation.
Peter Maydell [Fri, 24 Jan 2020 12:34:04 +0000 (12:34 +0000)]
Merge remote-tracking branch 'remotes/palmer/tags/riscv-for-master-5.0-sf1' into staging
RISC-V Patches for the 5.0 Soft Freeze, Part 1
This patch set contains a handful of collected fixes that I'd like to target
for the 5.0 soft freeze (I know that's a long way away, I just don't know what
else to call these):
* A fix for a memory leak initializing the sifive_u board.
* Fixes to privilege mode emulation related to interrupts and fstatus.
Notably absent is the H extension implementation. That's pretty much reviewed,
but not quite ready to go yet and I didn't want to hold back these important
fixes. This boots 32-bit and 64-bit Linux (buildroot this time, just for fun)
and passes "make check".
# gpg: Signature made Tue 21 Jan 2020 22:55:28 GMT
# gpg: using RSA key 2B3C3747446843B24A943A7A2E1319F35FBB1889
# gpg: issuer "[email protected]"
# gpg: Good signature from "Palmer Dabbelt <[email protected]>" [unknown]
# gpg: aka "Palmer Dabbelt <[email protected]>" [unknown]
# gpg: aka "Palmer Dabbelt <[email protected]>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 00CE 76D1 8349 60DF CE88 6DF8 EF4C A150 2CCB AB41
# Subkey fingerprint: 2B3C 3747 4468 43B2 4A94 3A7A 2E13 19F3 5FBB 1889
* remotes/palmer/tags/riscv-for-master-5.0-sf1:
target/riscv: update mstatus.SD when FS is set dirty
target/riscv: fsd/fsw doesn't dirty FP state
target/riscv: Fix tb->flags FS status
riscv: Set xPIE to 1 after xRET
riscv/sifive_u: fix a memory leak in soc_realize()
Peter Maydell [Fri, 24 Jan 2020 09:59:11 +0000 (09:59 +0000)]
Merge remote-tracking branch 'remotes/dgilbert-gitlab/tags/pull-virtiofs-20200123b' into staging
virtiofsd first pull v2
Import our virtiofsd.
This pulls in the daemon to drive a file system connected to the
existing qemu virtiofsd device.
It's derived from upstream libfuse with lots of changes (and a lot
trimmed out).
The daemon lives in the newly created qemu/tools/virtiofsd
Signed-off-by: Dr. David Alan Gilbert <[email protected]>
v2
drop the docs while we discuss where they should live
and we need to redo the manpage in anything but texi
# gpg: Signature made Thu 23 Jan 2020 16:45:18 GMT
# gpg: using RSA key 45F5C71B4A0CB7FB977A9FA90516331EBC5BFDE7
# gpg: Good signature from "Dr. David Alan Gilbert (RH2) <[email protected]>" [full]
# Primary key fingerprint: 45F5 C71B 4A0C B7FB 977A 9FA9 0516 331E BC5B FDE7
* remotes/dgilbert-gitlab/tags/pull-virtiofs-20200123b: (108 commits)
virtiofsd: add some options to the help message
virtiofsd: stop all queue threads on exit in virtio_loop()
virtiofsd/passthrough_ll: Pass errno to fuse_reply_err()
virtiofsd: Convert lo_destroy to take the lo->mutex lock itself
virtiofsd: add --thread-pool-size=NUM option
virtiofsd: fix lo_destroy() resource leaks
virtiofsd: prevent FUSE_INIT/FUSE_DESTROY races
virtiofsd: process requests in a thread pool
virtiofsd: use fuse_buf_writev to replace fuse_buf_write for better performance
virtiofsd: add definition of fuse_buf_writev()
virtiofsd: passthrough_ll: Use cache_readdir for directory open
virtiofsd: Fix data corruption with O_APPEND write in writeback mode
virtiofsd: Reset O_DIRECT flag during file open
virtiofsd: convert more fprintf and perror to use fuse log infra
virtiofsd: do not always set FUSE_FLOCK_LOCKS
virtiofsd: introduce inode refcount to prevent use-after-free
virtiofsd: passthrough_ll: fix refcounting on remove/rename
libvhost-user: Fix some memtable remap cases
virtiofsd: rename inode->refcount to inode->nlookup
virtiofsd: prevent races with lo_dirp_put()
...
* remotes/kraxel/tags/ui-20200123-pull-request:
ui/console: Display the 'none' backend in '-display help'
vnc: prioritize ZRLE compression over ZLIB
Revert "vnc: allow fall back to RAW encoding"
Masayoshi Mizuma [Wed, 18 Dec 2019 20:08:31 +0000 (15:08 -0500)]
virtiofsd: add some options to the help message
Add following options to the help message:
- cache
- flock|no_flock
- norace
- posix_lock|no_posix_lock
- readdirplus|no_readdirplus
- timeout
- writeback|no_writeback
- xattr|no_xattr
Signed-off-by: Masayoshi Mizuma <[email protected]>
dgilbert: Split cache, norace, posix_lock, readdirplus off
into our own earlier patches that added the options
Eryu Guan [Tue, 7 Jan 2020 04:15:21 +0000 (12:15 +0800)]
virtiofsd: stop all queue threads on exit in virtio_loop()
On guest graceful shutdown, virtiofsd receives VHOST_USER_GET_VRING_BASE
request from VMM and shuts down virtqueues by calling fv_set_started(),
which joins fv_queue_thread() threads. So when virtio_loop() returns,
there should be no thread is still accessing data in fuse session and/or
virtio dev.
But on abnormal exit, e.g. guest got killed for whatever reason,
vhost-user socket is closed and virtio_loop() breaks out the main loop
and returns to main(). But it's possible fv_queue_worker()s are still
working and accessing fuse session and virtio dev, which results in
crash or use-after-free.
Fix it by stopping fv_queue_thread()s before virtio_loop() returns,
to make sure there's no-one could access fuse session and virtio dev.
Xiao Yang [Thu, 2 Jan 2020 03:53:12 +0000 (11:53 +0800)]
virtiofsd/passthrough_ll: Pass errno to fuse_reply_err()
lo_copy_file_range() passes -errno to fuse_reply_err() and then fuse_reply_err()
changes it to errno again, so that subsequent fuse_send_reply_iov_nofree() catches
the wrong errno.(i.e. reports "fuse: bad error value: ...").
Make fuse_send_reply_iov_nofree() accept the correct -errno by passing errno
directly in lo_copy_file_range().
virtiofsd: Convert lo_destroy to take the lo->mutex lock itself
lo_destroy was relying on some implicit knowledge of the locking;
we can avoid this if we create an unref_inode that doesn't take
the lock and then grab it for the whole of the lo_destroy.
Stefan Hajnoczi [Thu, 1 Aug 2019 16:54:07 +0000 (17:54 +0100)]
virtiofsd: prevent FUSE_INIT/FUSE_DESTROY races
When running with multiple threads it can be tricky to handle
FUSE_INIT/FUSE_DESTROY in parallel with other request types or in
parallel with themselves. Serialize FUSE_INIT and FUSE_DESTROY so that
malicious clients cannot trigger race conditions.
Stefan Hajnoczi [Thu, 1 Aug 2019 16:54:06 +0000 (17:54 +0100)]
virtiofsd: process requests in a thread pool
Introduce a thread pool so that fv_queue_thread() just pops
VuVirtqElements and hands them to the thread pool. For the time being
only one worker thread is allowed since passthrough_ll.c is not
thread-safe yet. Future patches will lift this restriction so that
multiple FUSE requests can be processed in parallel.
The main new concept is struct FVRequest, which contains both
VuVirtqElement and struct fuse_chan. We now have fv_VuDev for a device,
fv_QueueInfo for a virtqueue, and FVRequest for a request. Some of
fv_QueueInfo's fields are moved into FVRequest because they are
per-request. The name FVRequest conforms to QEMU coding style and I
expect the struct fv_* types will be renamed in a future refactoring.
This patch series is not optimal. fbuf reuse is dropped so each request
does malloc(se->bufsize), but there is no clean and cheap way to keep
this with a thread pool. The vq_lock mutex is held for longer than
necessary, especially during the eventfd_write() syscall. Performance
can be improved in the future.
prctl(2) had to be added to the seccomp whitelist because glib invokes
it.
piaojun [Fri, 16 Aug 2019 03:42:21 +0000 (11:42 +0800)]
virtiofsd: use fuse_buf_writev to replace fuse_buf_write for better performance
fuse_buf_writev() only handles the normal write in which src is buffer
and dest is fd. Specially if src buffer represents guest physical
address that can't be mapped by the daemon process, IO must be bounced
back to the VMM to do it by fuse_buf_copy().
piaojun [Fri, 16 Aug 2019 03:41:16 +0000 (11:41 +0800)]
virtiofsd: add definition of fuse_buf_writev()
Define fuse_buf_writev() which use pwritev and writev to improve io
bandwidth. Especially, the src bufs with 0 size should be skipped as
their mems are not *block_size* aligned which will cause writev failed
in direct io mode.
Misono Tomohiro [Mon, 20 Jan 2020 02:53:30 +0000 (11:53 +0900)]
virtiofsd: passthrough_ll: Use cache_readdir for directory open
Since keep_cache(FOPEN_KEEP_CACHE) has no effect for directory as
described in fuse_common.h, use cache_readdir(FOPNE_CACHE_DIR) for
diretory open when cache=always mode.
Misono Tomohiro [Wed, 23 Oct 2019 12:25:23 +0000 (21:25 +0900)]
virtiofsd: Fix data corruption with O_APPEND write in writeback mode
When writeback mode is enabled (-o writeback), O_APPEND handling is
done in kernel. Therefore virtiofsd clears O_APPEND flag when open.
Otherwise O_APPEND flag takes precedence over pwrite() and write
data may corrupt.
Currently clearing O_APPEND flag is done in lo_open(), but we also
need the same operation in lo_create(). So, factor out the flag
update operation in lo_open() to update_open_flags() and call it
in both lo_open() and lo_create().
This fixes the failure of xfstest generic/069 in writeback mode
(which tests O_APPEND write data integrity).
Vivek Goyal [Tue, 20 Aug 2019 18:37:46 +0000 (14:37 -0400)]
virtiofsd: Reset O_DIRECT flag during file open
If an application wants to do direct IO and opens a file with O_DIRECT
in guest, that does not necessarily mean that we need to bypass page
cache on host as well. So reset this flag on host.
If somebody needs to bypass page cache on host as well (and it is safe to
do so), we can add a knob in daemon later to control this behavior.
I check virtio-9p and they do reset O_DIRECT flag.
Stefan Hajnoczi [Wed, 31 Jul 2019 16:10:06 +0000 (17:10 +0100)]
virtiofsd: introduce inode refcount to prevent use-after-free
If thread A is using an inode it must not be deleted by thread B when
processing a FUSE_FORGET request.
The FUSE protocol itself already has a counter called nlookup that is
used in FUSE_FORGET messages. We cannot trust this counter since the
untrusted client can manipulate it via FUSE_FORGET messages.
Introduce a new refcount to keep inodes alive for the required lifespan.
lo_inode_put() must be called to release a reference. FUSE's nlookup
counter holds exactly one reference so that the inode stays alive as
long as the client still wants to remember it.
Note that the lo_inode->is_symlink field is moved to avoid creating a
hole in the struct due to struct field alignment.
If a new setmemtable command comes in once the vhost threads are
running, it will remap the guests address space and the threads
will now be looking in the wrong place.
Fortunately we're running this command under lock, so we can
update the queue mappings so that threads will look in the new-right
place.
Note: This doesn't fix things that the threads might be doing
without a lock (e.g. a readv/writev!) That's for another time.
Stefan Hajnoczi [Wed, 31 Jul 2019 16:10:04 +0000 (17:10 +0100)]
virtiofsd: rename inode->refcount to inode->nlookup
This reference counter plays a specific role in the FUSE protocol. It's
not a generic object reference counter and the FUSE kernel code calls it
"nlookup".
Stefan Hajnoczi [Fri, 26 Jul 2019 09:11:01 +0000 (10:11 +0100)]
virtiofsd: make lo_release() atomic
Hold the lock across both lo_map_get() and lo_map_remove() to prevent
races between two FUSE_RELEASE requests. In this case I don't see a
serious bug but it's safer to do things atomically.
Stefan Hajnoczi [Wed, 17 Jul 2019 15:05:57 +0000 (16:05 +0100)]
virtiofsd: prevent fv_queue_thread() vs virtio_loop() races
We call into libvhost-user from the virtqueue handler thread and the
vhost-user message processing thread without a lock. There is nothing
protecting the virtqueue handler thread if the vhost-user message
processing thread changes the virtqueue or memory table while it is
running.
This patch introduces a read-write lock. Virtqueue handler threads are
readers. The vhost-user message processing thread is a writer. This
will allow concurrency for multiqueue in the future while protecting
against fv_queue_thread() vs virtio_loop() races.
Note that the critical sections could be made smaller but it would be
more invasive and require libvhost-user changes. Let's start simple and
improve performance later, if necessary. Another option would be an
RCU-style approach with lighter-weight primitives.
Vivek Goyal [Mon, 10 Jun 2019 19:22:06 +0000 (15:22 -0400)]
virtiofsd: Support remote posix locks
Doing posix locks with-in guest kernel are not sufficient if a file/dir
is being shared by multiple guests. So we need the notion of daemon doing
the locks which are visible to rest of the guests.
Given posix locks are per process, one can not call posix lock API on host,
otherwise bunch of basic posix locks properties are broken. For example,
If two processes (A and B) in guest open the file and take locks on different
sections of file, if one of the processes closes the fd, it will close
fd on virtiofsd and all posix locks on file will go away. This means if
process A closes the fd, then locks of process B will go away too.
Similar other problems exist too.
This patch set tries to emulate posix locks while using open file
description locks provided on Linux.
Daemon provides two options (-o posix_lock, -o no_posix_lock) to enable
or disable posix locking in daemon. By default it is enabled.
There are few issues though.
- GETLK() returns pid of process holding lock. As we are emulating locks
using OFD, and these locks are not per process and don't return pid
of process, so GETLK() in guest does not reuturn process pid.
- As of now only F_SETLK is supported and not F_SETLKW. We can't block
the thread in virtiofsd for arbitrary long duration as there is only
one thread serving the queue. That means unlock request will not make
it to daemon and F_SETLKW will block infinitely and bring virtio-fs
to a halt. This is a solvable problem though and will require significant
changes in virtiofsd and kernel. Left as a TODO item for now.