Instead of just relying on the comment "Called only on full-dirty
region" in block_copy_task_create() let's move initial dirty area
search directly to block_copy_task_create(). Let's also use effective
bdrv_dirty_bitmap_next_dirty_area instead of looping through all
non-dirty clusters.
We are going to use aio-task-pool API, so tasks will be handled in
parallel. We need therefore separate allocated task on each iteration.
Introduce this logic now.
Maxim Levitsky [Mon, 4 May 2020 13:19:59 +0000 (16:19 +0300)]
Fix iotest 153
Commit f62514b3def5fb2acbef64d0e053c0c31fa45aff made qemu-img reject -o "" but this test uses it.
Since this test only tries to do a dry-run run of qemu-img amend,
replace the -o "" with dummy -o "size=$size".
Eric Blake [Tue, 28 Apr 2020 19:26:48 +0000 (14:26 -0500)]
qcow2: Tweak comment about bitmaps vs. resize
Our comment did not actually match the code. Rewrite the comment to
be less sensitive to any future changes to qcow2-bitmap.c that might
implement scenarios that we currently reject.
Eric Blake [Tue, 28 Apr 2020 19:26:47 +0000 (14:26 -0500)]
qcow2: Allow resize of images with internal snapshots
We originally refused to allow resize of images with internal
snapshots because the v2 image format did not require the tracking of
snapshot size, making it impossible to safely revert to a snapshot
with a different size than the current view of the image. But the
snapshot size tracking was rectified in v3, and our recent fixes to
qemu-img amend (see 0a85af35) guarantee that we always have a valid
snapshot size. Thus, we no longer need to artificially limit image
resizes, but it does become one more thing that would prevent a
downgrade back to v2. And now that we support different-sized
snapshots, it's also easy to fix reverting to a snapshot to apply the
new size.
Upgrade iotest 61 to cover this (we previously had NO coverage of
refusal to resize while snapshots exist). Note that the amend process
can fail but still have effects: in particular, since we break things
into upgrade, resize, downgrade, a failure during resize does not roll
back changes made during upgrade, nor does failure in downgrade roll
back a resize. But this situation is pre-existing even without this
patch; and without journaling, the best we could do is minimize the
chance of partial failure by collecting all changes prior to doing any
writes - which adds a lot of complexity but could still fail with EIO.
On the other hand, we are careful that even if we have partial
modification but then fail, the image is left viable (that is, we are
careful to sequence things so that after each successful cluster
write, there may be transient leaked clusters but no corrupt
metadata). And complicating the code to make it more transaction-like
is not worth the effort: a user can always request multiple 'qemu-img
amend' changing one thing each, if they need finer-grained control
over detecting the first failure than what they get by letting qemu
decide how to sequence multiple changes.
John Snow [Tue, 31 Mar 2020 00:00:14 +0000 (20:00 -0400)]
iotests: use python logging for iotests.log()
We can turn logging on/off globally instead of per-function.
Remove use_log from run_job, and use python logging to turn on
diffable output when we run through a script entry point.
iotest 245 changes output order due to buffering reasons.
An extended note on python logging:
A NullHandler is added to `qemu.iotests` to stop output from being
generated if this code is used as a library without configuring logging.
A NullHandler is only needed at the root, so a duplicate handler is not
needed for `qemu.iotests.diff_io`.
When logging is not configured, messages at the 'WARNING' levels or
above are printed with default settings. The NullHandler stops this from
occurring, which is considered good hygiene for code used as a library.
See https://docs.python.org/3/howto/logging.html#library-config
When logging is actually enabled (always at the behest of an explicit
call by a client script), a root logger is implicitly created at the
root, which allows messages to propagate upwards and be handled/emitted
from the root logger with default settings.
When we want iotest logging, we attach a handler to the
qemu.iotests.diff_io logger and disable propagation to avoid possible
double-printing.
For more information on python logging infrastructure, I highly
recommend downloading the pip package `logging_tree`, which provides
convenient visualizations of the hierarchical logging configuration
under different circumstances.
See https://pypi.org/project/logging_tree/ for more information.
John Snow [Tue, 31 Mar 2020 00:00:10 +0000 (20:00 -0400)]
iotests: add hmp helper with logging
Minor cleanup for HMP functions; helps with line length and consolidates
HMP helpers through one implementation function.
Although we are adding a universal toggle to turn QMP logging on or off,
many existing callers to hmp functions don't expect that output to be
logged, which causes quite a few changes in the test output.
For now, offer a use_log parameter.
Typing notes:
QMPResponse is just an alias for Dict[str, Any]. It holds no special
meanings and it is not a formal subtype of Dict[str, Any]. It is best
thought of as a lexical synonym.
We may well wish to add stricter subtypes in the future for certain
shapes of data that are not formalized as Python objects, at which point
we can simply retire the alias and allow mypy to more strictly check
usages of the name.
John Snow [Tue, 31 Mar 2020 00:00:08 +0000 (20:00 -0400)]
iotests: touch up log function signature
Representing nested, recursive data structures in mypy is notoriously
difficult; the best we can reliably do right now is denote the leaf
types as "Any" while describing the general shape of the data.
Regardless, this fully annotates the log() function.
Typing notes:
TypeVar is a Type variable that can optionally be constrained by a
sequence of possible types. This variable is bound to a specific type
per-invocation, like a Generic.
log() behaves as log<Msg>() now, where the incoming type informs the
signature it expects for any filter arguments passed in. If Msg is a
str, then filter should take and return a str.
John Snow [Tue, 31 Mar 2020 00:00:03 +0000 (20:00 -0400)]
iotests: ignore import warnings from pylint
The right way to solve this is to come up with a virtual environment
infrastructure that sets all the paths correctly, and/or to create
installable python modules that can be imported normally.
John Snow [Tue, 31 Mar 2020 00:00:01 +0000 (20:00 -0400)]
iotests: do a light delinting
This doesn't fix everything in here, but it does help clean up the
pylint report considerably.
This should be 100% style changes only; the intent is to make pylint
more useful by working on establishing a baseline for iotests that we
can gate against in the future.
Liran Alon [Fri, 13 Mar 2020 14:50:08 +0000 (16:50 +0200)]
acpi: Add Windows ACPI Emulated Device Table (WAET)
Microsoft introduced this ACPI table to avoid Windows guests performing
various workarounds for device erratas. As the virtual device emulated
by VMM may not have the errata.
Currently, WAET allows hypervisor to inform guest about two
specific behaviors: One for RTC and the other for ACPI PM timer.
Support for WAET have been introduced since Windows Vista. This ACPI
table is also exposed by other common hypervisors by default, including:
VMware, GCP and AWS.
This patch adds WAET ACPI Table to QEMU.
We set "ACPI PM timer good" bit in "Emualted Device Flags" field to
indicate that the ACPI PM timer has been enhanced to not require
multiple reads to obtain a reliable value.
This results in improving the performance of Windows guests that use
ACPI PM timer by avoiding unnecessary VMExits caused by these multiple
reads.
Raphael Norwitz [Wed, 25 Mar 2020 10:35:06 +0000 (06:35 -0400)]
Refactor vhost_user_set_mem_table functions
vhost_user_set_mem_table() and vhost_user_set_mem_table_postcopy() have
gotten convoluted, and have some identical code.
This change moves the logic populating the VhostUserMemory struct and
fds array from vhost_user_set_mem_table() and
vhost_user_set_mem_table_postcopy() to a new function,
vhost_user_fill_set_mem_table_msg().
tests: Update ACPI tables list for upcoming arm/virt test changes
This is in preparation to update test_acpi_virt_tcg_memhp()
with pc-dimm and nvdimm. Update the bios-tables-test-allowed-diff.h
with the affected ACPI tables so that "make check" doesn't fail.
Also add empty files for new tables required for new test.
This adds support for nvdimm hotplug events through GED
and enables nvdimm for the arm/virt. Now Guests with ACPI
can have both cold and hot plug of nvdimms.
Kwangwoo Lee [Tue, 21 Apr 2020 12:59:29 +0000 (13:59 +0100)]
nvdimm: Use configurable ACPI IO base and size
This patch makes IO base and size configurable to create NPIO AML for
ACPI NFIT. Since a different architecture like AArch64 does not use
port-mapped IO, a configurable IO base is required to create correct
mapping of ACPI IO address and size.
hw/acpi/nvdimm: Fix for NVDIMM incorrect DSM output buffer length
As per ACPI spec 6.3, Table 19-419 Object Conversion Rules, if
the Buffer Field <= to the size of an Integer (in bits), it will
be treated as an integer. Moreover, the integer size depends on
DSDT tables revision number. If revision number is < 2, integer
size is 32 bits, otherwise it is 64 bits. Current NVDIMM common
DSM aml code (NCAL) uses CreateField() for creating DSM output
buffer. This creates an issue in arm/virt platform where DSDT
revision number is 2 and results in DSM buffer with a wrong
size(8 bytes) gets returned when actual length is < 8 bytes.
This causes guest kernel to report,
"nfit ACPI0012:00: found a zero length table '0' parsing nfit"
In order to fix this, aml code is now modified such that it builds
the DSM output buffer in a byte by byte fashion when length is
smaller than Integer size.
checkpatch: fix acpi check with multiple file name
Using global expected/nonexpected values causes
false positives when testing multiple patches in one
checkpatch run: one patch can change expected,
another one non-expected.
Li Feng [Fri, 17 Apr 2020 10:17:07 +0000 (18:17 +0800)]
vhost-user-blk: fix invalid memory access
when s->inflight is freed, vhost_dev_free_inflight may try to access
s->inflight->addr, it will retrigger the following issue.
==7309==ERROR: AddressSanitizer: heap-use-after-free on address 0x604001020d18 at pc 0x555555ce948a bp 0x7fffffffb170 sp 0x7fffffffb160
READ of size 8 at 0x604001020d18 thread T0
#0 0x555555ce9489 in vhost_dev_free_inflight /root/smartx/qemu-el7/qemu-test/hw/virtio/vhost.c:1473
#1 0x555555cd86eb in virtio_reset /root/smartx/qemu-el7/qemu-test/hw/virtio/virtio.c:1214
#2 0x5555560d3eff in virtio_pci_reset hw/virtio/virtio-pci.c:1859
#3 0x555555f2ac53 in device_set_realized hw/core/qdev.c:893
#4 0x5555561d572c in property_set_bool qom/object.c:1925
#5 0x5555561de8de in object_property_set_qobject qom/qom-qobject.c:27
#6 0x5555561d99f4 in object_property_set_bool qom/object.c:1188
#7 0x555555e50ae7 in qdev_device_add /root/smartx/qemu-el7/qemu-test/qdev-monitor.c:626
#8 0x555555e51213 in qmp_device_add /root/smartx/qemu-el7/qemu-test/qdev-monitor.c:806
#9 0x555555e8ff40 in hmp_device_add /root/smartx/qemu-el7/qemu-test/hmp.c:1951
#10 0x555555be889a in handle_hmp_command /root/smartx/qemu-el7/qemu-test/monitor.c:3404
#11 0x555555beac8b in monitor_command_cb /root/smartx/qemu-el7/qemu-test/monitor.c:4296
#12 0x555556433eb7 in readline_handle_byte util/readline.c:393
#13 0x555555be89ec in monitor_read /root/smartx/qemu-el7/qemu-test/monitor.c:4279
#14 0x5555563285cc in tcp_chr_read chardev/char-socket.c:470
#15 0x7ffff670b968 in g_main_context_dispatch (/lib64/libglib-2.0.so.0+0x4a968)
#16 0x55555640727c in glib_pollfds_poll util/main-loop.c:215
#17 0x55555640727c in os_host_main_loop_wait util/main-loop.c:238
#18 0x55555640727c in main_loop_wait util/main-loop.c:497
#19 0x555555b2d0bf in main_loop /root/smartx/qemu-el7/qemu-test/vl.c:2013
#20 0x555555b2d0bf in main /root/smartx/qemu-el7/qemu-test/vl.c:4776
#21 0x7fffdd2eb444 in __libc_start_main (/lib64/libc.so.6+0x22444)
#22 0x555555b3767a (/root/smartx/qemu-el7/qemu-test/x86_64-softmmu/qemu-system-x86_64+0x5e367a)
0x604001020d18 is located 8 bytes inside of 40-byte region [0x604001020d10,0x604001020d38)
freed by thread T0 here:
#0 0x7ffff6f00508 in __interceptor_free (/lib64/libasan.so.4+0xde508)
#1 0x7ffff671107d in g_free (/lib64/libglib-2.0.so.0+0x5007d)
previously allocated by thread T0 here:
#0 0x7ffff6f00a88 in __interceptor_calloc (/lib64/libasan.so.4+0xdea88)
#1 0x7ffff6710fc5 in g_malloc0 (/lib64/libglib-2.0.so.0+0x4ffc5)
SUMMARY: AddressSanitizer: heap-use-after-free /root/smartx/qemu-el7/qemu-test/hw/virtio/vhost.c:1473 in vhost_dev_free_inflight
Shadow bytes around the buggy address:
0x0c08801fc150: fa fa 00 00 00 00 04 fa fa fa fd fd fd fd fd fa
0x0c08801fc160: fa fa fd fd fd fd fd fd fa fa 00 00 00 00 04 fa
0x0c08801fc170: fa fa 00 00 00 00 00 01 fa fa 00 00 00 00 04 fa
0x0c08801fc180: fa fa 00 00 00 00 00 01 fa fa 00 00 00 00 00 01
0x0c08801fc190: fa fa 00 00 00 00 00 fa fa fa 00 00 00 00 04 fa
=>0x0c08801fc1a0: fa fa fd[fd]fd fd fd fa fa fa fd fd fd fd fd fa
0x0c08801fc1b0: fa fa fd fd fd fd fd fa fa fa fd fd fd fd fd fa
0x0c08801fc1c0: fa fa 00 00 00 00 00 fa fa fa fd fd fd fd fd fd
0x0c08801fc1d0: fa fa 00 00 00 00 00 01 fa fa fd fd fd fd fd fa
0x0c08801fc1e0: fa fa fd fd fd fd fd fa fa fa fd fd fd fd fd fd
0x0c08801fc1f0: fa fa 00 00 00 00 00 01 fa fa fd fd fd fd fd fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
==7309==ABORTING
With virtio-vga, pci bar are reordered. Bar #2 is used for compatibility
with stdvga. By default, bar #2 is used by virtio modern io bar.
This bar is the last one introduce in the virtio pci bar layout and it's
crushed by the virtio-vga reordering. So virtio-vga and
modern-pio-notify are incompatible because virtio-vga failed to
initialize with this option.
This fix sets the modern io bar to the bar #5 to avoid conflict.
Julia Suvorova [Mon, 27 Apr 2020 18:24:39 +0000 (20:24 +0200)]
hw/pci/pcie: Forbid hot-plug if it's disabled on the slot
Raise an error when trying to hot-plug/unplug a device through QMP to a device
with disabled hot-plug capability. This makes the device behaviour more
consistent and provides an explanation of the failure in the case of
asynchronous unplug.
Peter Maydell [Mon, 4 May 2020 12:37:17 +0000 (13:37 +0100)]
Merge remote-tracking branch 'remotes/pmaydell/tags/pull-target-arm-20200504' into staging
target-arm queue:
* Start of conversion of Neon insns to decodetree
* versal board: support SD and RTC
* Implement ARMv8.2-TTS2UXN
* Make VQDMULL undefined when U=1
* Some minor code cleanups
Peter Maydell [Thu, 30 Apr 2020 18:09:49 +0000 (19:09 +0100)]
target/arm: Move gen_ function typedefs to translate.h
We're going to want at least some of the NeonGen* typedefs
for the refactored 32-bit Neon decoder, so move them all
to translate.h since it makes more sense to keep them in
one group.
Peter Maydell [Thu, 30 Apr 2020 18:09:42 +0000 (19:09 +0100)]
target/arm: Convert Neon 3-reg-same logic ops to decodetree
Convert the Neon logic ops in the 3-reg-same grouping to decodetree.
Note that for the logic ops the 'size' field forms part of their
decode and the actual operations are always bitwise.
Peter Maydell [Thu, 30 Apr 2020 18:09:41 +0000 (19:09 +0100)]
target/arm: Convert Neon 3-reg-same VADD/VSUB to decodetree
Convert the Neon 3-reg-same VADD and VSUB insns to decodetree.
Note that we don't need the neon_3r_sizes[op] check here because all
size values are OK for VADD and VSUB; we'll add this when we convert
the first insn that has size restrictions.
For this we need one of the GVecGen*Fn typedefs currently in
translate-a64.h; move them all to translate.h as a block so they
are visible to the 32-bit decoder.
Peter Maydell [Thu, 30 Apr 2020 18:09:37 +0000 (19:09 +0100)]
target/arm: Convert VFM[AS]L (scalar) to decodetree
Convert the VFM[AS]L (scalar) insns in the 2reg-scalar-ext group
to decodetree. These are the last ones in the group so we can remove
all the legacy decode for the group.
Note that in disas_thumb2_insn() the parts of this encoding space
where the decodetree decoder returns false will correctly be directed
to illegal_op by the "(insn & (1 << 28))" check so they won't fall
into disas_coproc_insn() by mistake.
Peter Maydell [Thu, 30 Apr 2020 18:09:34 +0000 (19:09 +0100)]
target/arm: Convert VFM[AS]L (vector) to decodetree
Convert the VFM[AS]L (vector) insns to decodetree. This is the last
insn in the legacy decoder for the 3same_ext group, so we can
delete the legacy decoder function for the group entirely.
Note that in disas_thumb2_insn() the parts of this encoding space
where the decodetree decoder returns false will correctly be directed
to illegal_op by the "(insn & (1 << 28))" check so they won't fall
into disas_coproc_insn() by mistake.
Peter Maydell [Thu, 30 Apr 2020 18:09:30 +0000 (19:09 +0100)]
target/arm: Add stubs for AArch32 Neon decodetree
Add the infrastructure for building and invoking a decodetree decoder
for the AArch32 Neon encodings. At the moment the new decoder covers
nothing, so we always fall back to the existing hand-written decode.
We follow the same pattern we did for the VFP decodetree conversion
(commit 78e138bc1f672c145ef6ace74617d and following): code that deals
with Neon will be moving gradually out to translate-neon.vfp.inc,
which we #include into translate.c.
In order to share the decode files between A32 and T32, we
split Neon into 3 parts:
* data-processing
* load-store
* 'shared' encodings
The first two groups of instructions have similar but not identical
A32 and T32 encodings, so we need to manually transform the T32
encoding into the A32 one before calling the decoder; the third group
covers the Neon instructions which are identical in A32 and T32.
Peter Maydell [Thu, 30 Apr 2020 18:09:29 +0000 (19:09 +0100)]
target/arm: Don't allow Thumb Neon insns without FEATURE_NEON
We were accidentally permitting decode of Thumb Neon insns even if
the CPU didn't have the FEATURE_NEON bit set, because the feature
check was being done before the call to disas_neon_data_insn() and
disas_neon_ls_insn() in the Arm decoder but was omitted from the
Thumb decoder. Push the feature bit check down into the called
functions so it is done for both Arm and Thumb encodings.
Somewhere along theline we accidentally added a duplicate
"using D16-D31 when they don't exist" check to do_vfm_dp()
(probably an artifact of a patchseries rebase). Remove it.
target/arm: Use uint64_t for midr field in CPU state struct
MIDR_EL1 is a 64-bit system register with the top 32-bit being RES0.
Represent it in QEMU's ARMCPU struct with a uint64_t, not a
uint32_t.
This fixes an error when compiling with -Werror=conversion
because we were manipulating the register value using a
local uint64_t variable:
target/arm/cpu64.c: In function ‘aarch64_max_initfn’:
target/arm/cpu64.c:628:21: error: conversion from ‘uint64_t’ {aka ‘long unsigned int’} to ‘uint32_t’ {aka ‘unsigned int’} may change value [-Werror=conversion]
628 | cpu->midr = t;
| ^
and future-proofs us against a possible future architecture
change using some of the top 32 bits.
Peter Maydell [Thu, 23 Apr 2020 11:09:15 +0000 (12:09 +0100)]
target/arm: Use correct variable for setting 'max' cpu's ID_AA64DFR0
In aarch64_max_initfn() we update both 32-bit and 64-bit ID
registers. The intended pattern is that for 64-bit ID registers we
use FIELD_DP64 and the uint64_t 't' register, while 32-bit ID
registers use FIELD_DP32 and the uint32_t 'u' register. For
ID_AA64DFR0 we accidentally used 'u', meaning that the top 32 bits of
this 64-bit ID register would end up always zero. Luckily at the
moment that's what they should be anyway, so this bug has no visible
effects.
Peter Maydell [Mon, 30 Mar 2020 21:04:00 +0000 (22:04 +0100)]
target/arm: Implement ARMv8.2-TTS2UXN
The ARMv8.2-TTS2UXN feature extends the XN field in stage 2
translation table descriptors from just bit [54] to bits [54:53],
allowing stage 2 to control execution permissions separately for EL0
and EL1. Implement the new semantics of the XN field and enable
the feature for our 'max' CPU.
Peter Maydell [Mon, 30 Mar 2020 21:03:59 +0000 (22:03 +0100)]
target/arm: Add new 's1_is_el0' argument to get_phys_addr_lpae()
For ARMv8.2-TTS2UXN, the stage 2 page table walk wants to know
whether the stage 1 access is for EL0 or not, because whether
exec permission is given can depend on whether this is an EL0
or EL1 access. Add a new argument to get_phys_addr_lpae() so
the call sites can pass this information in.
Since get_phys_addr_lpae() doesn't already have a doc comment,
add one so we have a place to put the documentation of the
semantics of the new s1_is_el0 argument.
Peter Maydell [Mon, 30 Mar 2020 21:03:58 +0000 (22:03 +0100)]
target/arm: Use enum constant in get_phys_addr_lpae() call
The access_type argument to get_phys_addr_lpae() is an MMUAccessType;
use the enum constant MMU_DATA_LOAD rather than a literal 0 when we
call it in S1_ptw_translate().
Peter Maydell [Mon, 30 Mar 2020 21:03:57 +0000 (22:03 +0100)]
target/arm: Don't use a TLB for ARMMMUIdx_Stage2
We define ARMMMUIdx_Stage2 as being an MMU index which uses a QEMU
TLB. However we never actually use the TLB -- all stage 2 lookups
are done by direct calls to get_phys_addr_lpae() followed by a
physical address load via address_space_ld*().
Remove Stage2 from the list of ARM MMU indexes which correspond to
real core MMU indexes, and instead put it in the set of "NOTLB" ARM
MMU indexes.
This allows us to drop NB_MMU_MODES to 11. It also means we can
safely add support for the ARMv8.3-TTS2UXN extension, which adds
permission bits to the stage 2 descriptors which define execute
permission separatel for EL0 and EL1; supporting that while keeping
Stage2 in a QEMU TLB would require us to use separate TLBs for
"Stage2 for an EL0 access" and "Stage2 for an EL1 access", which is a
lot of extra complication given we aren't even using the QEMU TLB.
In the process of updating the comment on our MMU index use,
fix a couple of other minor errors:
* NS EL2 EL2&0 was missing from the list in the comment
* some text hadn't been updated from when we bumped NB_MMU_MODES
above 8
Peter Maydell [Sun, 3 May 2020 13:12:56 +0000 (14:12 +0100)]
Merge remote-tracking branch 'remotes/marcel/tags/rdma-pull-request' into staging
RDMA queue
* hw/rdma: Destroy list mutex when list is destroyed
# gpg: Signature made Sat 02 May 2020 19:42:50 BST
# gpg: using RSA key 36D4C0F0CF2FE46D
# gpg: Good signature from "Marcel Apfelbaum <[email protected]>" [unknown]
# gpg: aka "Marcel Apfelbaum <[email protected]>" [marginal]
# gpg: aka "Marcel Apfelbaum <[email protected]>" [unknown]
# gpg: WARNING: This key is not certified with sufficiently trusted signatures!
# gpg: It is not certain that the signature belongs to the owner.
# Primary key fingerprint: B1C6 3A57 F92E 08F2 640F 31F5 36D4 C0F0 CF2F E46D
* remotes/marcel/tags/rdma-pull-request:
hw/rdma: Destroy list mutex when list is destroyed
Peter Maydell [Fri, 1 May 2020 22:10:22 +0000 (23:10 +0100)]
Merge remote-tracking branch 'remotes/dgilbert-gitlab/tags/pull-virtiofs-20200501' into staging
virtiofsd: Pull 2020-05-01 (includes CVE fix)
This set includes a security fix, other fixes and improvements.
Security fix:
The security fix is for CVE-2020-10717 where, on low RAM hosts,
the guest can potentially exceed the maximum fd limit.
This fix adds some more configuration so that the user
can explicitly set the limit.
Fixes:
Recursive mounting of the exported directory is now used in
the sandbox, such that if there was a mount underneath present at
the time the virtiofsd was started, that mount is also
visible to the guest; in the existing code, only mounts that
happened after startup were visible.
Security improvements:
The jailing for /proc/self/fd is improved - but it's something
that shouldn't be accessible anyway.
Most capabilities are now dropped at startup; again this shouldn't
change any behaviour but is extra protection.
# gpg: Signature made Fri 01 May 2020 20:06:46 BST
# gpg: using RSA key 45F5C71B4A0CB7FB977A9FA90516331EBC5BFDE7
# gpg: Good signature from "Dr. David Alan Gilbert (RH2) <[email protected]>" [full]
# Primary key fingerprint: 45F5 C71B 4A0C B7FB 977A 9FA9 0516 331E BC5B FDE7
* remotes/dgilbert-gitlab/tags/pull-virtiofs-20200501:
virtiofsd: drop all capabilities in the wait parent process
virtiofsd: only retain file system capabilities
virtiofsd: Show submounts
virtiofsd: jail lo->proc_self_fd
virtiofsd: stay below fs.file-max sysctl value (CVE-2020-10717)
virtiofsd: add --rlimit-nofile=NUM option
Stefan Hajnoczi [Thu, 16 Apr 2020 16:49:06 +0000 (17:49 +0100)]
virtiofsd: only retain file system capabilities
virtiofsd runs as root but only needs a subset of root's Linux
capabilities(7). As a file server its purpose is to create and access
files on behalf of a client. It needs to be able to access files with
arbitrary uid/gid owners. It also needs to be create device nodes.
Introduce a Linux capabilities(7) whitelist and drop all capabilities
that we don't need, making the virtiofsd process less powerful than a
regular uid root process.
Note that file capabilities cannot be used to achieve the same effect on
the virtiofsd executable because mount is used during sandbox setup.
Therefore we drop capabilities programmatically at the right point
during startup.
This patch only affects the sandboxed child process. The parent process
that sits in waitpid(2) still has full root capabilities and will be
addressed in the next patch.
Max Reitz [Fri, 24 Apr 2020 13:35:16 +0000 (15:35 +0200)]
virtiofsd: Show submounts
Currently, setup_mounts() bind-mounts the shared directory without
MS_REC. This makes all submounts disappear.
Pass MS_REC so that the guest can see submounts again.
Fixes: 5baa3b8e95064c2434bd9e2f312edd5e9ae275dc Signed-off-by: Max Reitz <[email protected]>
Message-Id: <20200424133516[email protected]> Reviewed-by: Dr. David Alan Gilbert <[email protected]> Signed-off-by: Dr. David Alan Gilbert <[email protected]>
Changed Fixes to point to the commit with the problem rather than
the commit that turned it on
While it's not possible to escape the proc filesystem through
lo->proc_self_fd, it is possible to escape to the root of the proc
filesystem itself through "../..".
Use a temporary mount for opening lo->proc_self_fd, that has it's root at
/proc/self/fd/, preventing access to the ancestor directories.