Git Repo - linux.git/log

vfio/gvt: fix DRM_I915_GVT dependency on VFIO_MDEV

The Kconfig dependency is incomplete since DRM_I915_GVT is a 'bool'
symbol that depends on the 'tristate' VFIO_MDEV. This allows a
configuration with VFIO_MDEV=m, DRM_I915_GVT=y and DRM_I915=y that
causes a link failure:

x86_64-linux-ld: drivers/gpu/drm/i915/gvt/gvt.o: in function `available_instances_show':
gvt.c:(.text+0x67a): undefined reference to `mtype_get_parent_dev'
x86_64-linux-ld: gvt.c:(.text+0x6a5): undefined reference to `mtype_get_type_group_id'
x86_64-linux-ld: drivers/gpu/drm/i915/gvt/gvt.o: in function `description_show':
gvt.c:(.text+0x76e): undefined reference to `mtype_get_parent_dev'
x86_64-linux-ld: gvt.c:(.text+0x799): undefined reference to `mtype_get_type_group_id'

Clarify the dependency by specifically disallowing the broken
configuration. If VFIO_MDEV is built-in, it will work, but if
VFIO_MDEV=m, the i915 driver cannot be built-in here.

Fixes: 07e543f4f9d1 ("vfio/gvt: Make DRM_I915_GVT depend on VFIO_MDEV")
Fixes: 9169cff168ff ("vfio/mdev: Correct the function signatures for the mdev_type_attributes")
Signed-off-by: Arnd Bergmann <[email protected]>
Acked-by: Zhenyu Wang <[email protected]>
Message-Id: <20210422133547.1861063 [email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/iommu_type1: Remove unused pinned_page_dirty_scope in vfio_iommu

pinned_page_dirty_scope is optimized out by commit 010321565a7d
("vfio/iommu_type1: Mantain a counter for non_pinned_groups"),
but appears again due to some issues during merging branches.
We can safely remove it here.

Signed-off-by: Keqian Zhu <[email protected]>
Message-Id: <20210412024415 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/mdev: Correct the function signatures for the mdev_type_attributes

The driver core standard is to pass in the properly typed object, the
properly typed attribute and the buffer data. It stems from the root
kobject method:

  ssize_t (*show)(struct kobject *kobj, struct kobj_attribute *attr,..)

Each subclass of kobject should provide their own function with the same
signature but more specific types, eg struct device uses:

  ssize_t (*show)(struct device *dev, struct device_attribute *attr,..)

In this case the existing signature is:

  ssize_t (*show)(struct kobject *kobj, struct device *dev,..)

Where kobj is a 'struct mdev_type *' and dev is 'mdev_type->parent->dev'.

Change the mdev_type related sysfs attribute functions to:

  ssize_t (*show)(struct mdev_type *mtype, struct mdev_type_attribute *attr,..)

In order to restore type safety and match the driver core standard

There are no current users of 'attr', but if it is ever needed it would be
hard to add in retroactively, so do it now.

Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <18-v2-d36939638fc6 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/mdev: Remove kobj from mdev_parent_ops->create()

The kobj here is a type-erased version of mdev_type, which is already
stored in the struct mdev_device being passed in. It was only ever used to
compute the type_group_id, which is now extracted directly from the mdev.

Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <17-v2-d36939638fc6 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/gvt: Use mdev_get_type_group_id()

intel_gvt_init_vgpu_type_groups() makes gvt->types 1:1 with the
supported_type_groups array, so the type_group_id is also the index into
gvt->types. Use it directly and remove the string matching.

Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <16-v2-d36939638fc6 [email protected]>
Reviewed-by: Zhenyu Wang <[email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/gvt: Make DRM_I915_GVT depend on VFIO_MDEV

At some point there may have been some reason for this weird split in this
driver, but today only the VFIO side is actually implemented.

However, it got messed up at some point and mdev code was put in gvt.c and
is pretending to be "generic" by masquerading as some generic attribute list:

static MDEV_TYPE_ATTR_RO(description);

But MDEV_TYPE attributes are only usable with mdev_device, nothing else.

Ideally all of this would be moved to kvmgt.c, but it is entangled with
the rest of the "generic" code in an odd way. Thus put in a kconfig
dependency so we don't get randconfig failures when the next patch creates
a link time dependency related to the use of MDEV_TYPE.

Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <15-v2-d36939638fc6 [email protected]>
Acked-by: Zhenyu Wang <[email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/mbochs: Use mdev_get_type_group_id()

The mbochs_types array is parallel to the supported_type_groups array, so
the type_group_id indexes both. Instead of doing string searching just
directly index with type_group_id in all places.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <14-v2-d36939638fc6 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/mdpy: Use mdev_get_type_group_id()

The mdpy_types array is parallel to the supported_type_groups array, so
the type_group_id indexes both. Instead of doing string searching just
directly index with type_group_id in all places.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <13-v2-d36939638fc6 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/mtty: Use mdev_get_type_group_id()

The type_group_id directly gives the single or dual port index, no
need for string searching.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <12-v2-d36939638fc6 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/mdev: Add mdev/mtype_get_type_group_id()

This returns the index in the supported_type_groups array that is
associated with the mdev_type attached to the struct mdev_device or its
containing struct kobject.

Each mdev_device can be spawned from exactly one mdev_type, which in turn
originates from exactly one supported_type_group.

Drivers are using weird string calculations to try and get back to this
index, providing a direct access to the index removes a bunch of wonky
driver code.

mdev_type->group can be deleted as the group is obtained using the
type_group_id.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <11-v2-d36939638fc6 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/mdev: Remove duplicate storage of parent in mdev_device

mdev_device->type->parent is the same thing.

The struct mdev_device was relying on the kref on the mdev_parent to also
indirectly hold a kref on the mdev_type pointer. Now that the type holds a
kref on the parent we can directly kref the mdev_type and remove this
implicit relationship.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <10-v2-d36939638fc6 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/mdev: Add missing error handling to dev_set_name()

This can fail, and seems to be a popular target for syzkaller error
injection. Check the error return and unwind with put_device().

Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <9-v2-d36939638fc6 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/mdev: Reorganize mdev_device_create()

Once the memory for the struct mdev_device is allocated it should
immediately be device_initialize()'d and filled in so that put_device()
can always be used to undo the allocation.

Place the mdev_get/put_parent() so that they are clearly protecting the
mdev->parent pointer. Move the final put to the release function so that
the lifetime rules are trivial to understand. Update the goto labels to
follow the normal convention.

Remove mdev_device_free() as the release function via device_put() is now
usable in all cases.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <8-v2-d36939638fc6 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/mdev: Add missing reference counting to mdev_type

struct mdev_type holds a pointer to the kref'd object struct mdev_parent,
but doesn't hold the kref. The lifetime of the parent becomes implicit
because parent_remove_sysfs_files() is supposed to remove all the access
before the parent can be freed, but this is very hard to reason about.

Make it obviously correct by adding the missing get.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <7-v2-d36939638fc6 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/mdev: Expose mdev_get/put_parent to mdev_private.h

The next patch will use these in mdev_sysfs.c

While here remove the now dead code checks for NULL, a mdev_type can never
have a NULL parent.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <6-v2-d36939638fc6 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/mdev: Use struct mdev_type in struct mdev_device

The kobj pointer in mdev_device is actually pointing at a struct
mdev_type. Use the proper type so things are understandable.

There are a number of places that are confused and passing both the mdev
and the mtype as function arguments, fix these to derive the mtype
directly from the mdev to remove the redundancy.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <5-v2-d36939638fc6 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/mdev: Simplify driver registration

This is only done once, we don't need to generate code to initialize a
structure stored in the ELF .data segment. Fill in the three required
.driver members directly instead of copying data into them during
mdev_register_driver().

Further the to_mdev_driver() function doesn't belong in a public header,
just inline it into the two places that need it. Finally, we can now
clearly see that 'drv' derived from dev->driver cannot be NULL, firstly
because the driver core forbids it, and secondly because NULL won't pass
through the container_of(). Remove the dead code.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <4-v2-d36939638fc6 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/mdev: Add missing typesafety around mdev_device

The mdev API should accept and pass a 'struct mdev_device *' in all
places, not pass a 'struct device *' and cast it internally with
to_mdev_device(). Particularly in its struct mdev_driver functions, the
whole point of a bus's struct device_driver wrapper is to provide type
safety compared to the default struct device_driver.

Further, the driver core standard is for bus drivers to expose their
device structure in their public headers that can be used with
container_of() inlines and '&foo->dev' to go between the class levels, and
'&foo->dev' to be used with dev_err/etc driver core helper functions. Move
'struct mdev_device' to mdev.h

Once done this allows moving some one instruction exported functions to
static inlines, which in turns allows removing one of the two grotesque
symbol_get()'s related to mdev in the core code.

Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Message-Id: <3-v2-d36939638fc6 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/mdev: Do not allow a mdev_type to have a NULL parent pointer

There is a small race where the parent is NULL even though the kobj has
already been made visible in sysfs.

For instance the attribute_group is made visible in sysfs_create_files()
and the mdev_type_attr_show() does:

ret = attr->show(kobj, type->parent->dev, buf);

Which will crash on NULL parent. Move the parent setup to before the type
pointer leaves the stack frame.

Fixes: 7b96953bc640 ("vfio: Mediated device Core driver")
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <2-v2-d36939638fc6 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/mdev: Fix missing static's on MDEV_TYPE_ATTR's

These should always be prefixed with static, otherwise compilation
will fail on non-modular builds with

ld: samples/vfio-mdev/mbochs.o:(.data+0x2e0): multiple definition of `mdev_type_attr_name'; samples/vfio-mdev/mdpy.o:(.data+0x240): first defined here

Fixes: a5e6e6505f38 ("sample: vfio bochs vbe display (host device for bochs-drm)")
Fixes: d61fc96f47fd ("sample: vfio mdev display - host device")
Signed-off-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Message-Id: <1-v2-d36939638fc6 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

Merge branches 'v5.13/vfio/embed-vfio_device', 'v5.13/vfio/misc' and 'v5.13/vfio/nvlink' into v5.13/vfio/next

Spelling fixes merged with file deletion.

Conflicts:
drivers/vfio/pci/vfio_pci_nvlink2.c

Signed-off-by: Alex Williamson <[email protected]>

vfio: Remove device_data from the vfio bus driver API

There are no longer any users, so it can go away. Everything is using
container_of now.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <14-v3-225de1400dfc [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/pci: Replace uses of vfio_device_data() with container_of

This tidies a few confused places that think they can have a refcount on
the vfio_device but the device_data could be NULL, that isn't possible by
design.

Most of the change falls out when struct vfio_devices is updated to just
store the struct vfio_pci_device itself. This wasn't possible before
because there was no easy way to get from the 'struct vfio_pci_device' to
the 'struct vfio_device' to put back the refcount.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <13-v3-225de1400dfc [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio: Make vfio_device_ops pass a 'struct vfio_device *' instead of 'void *'

This is the standard kernel pattern, the ops associated with a struct get
the struct pointer in for typesafety. The expected design is to use
container_of to cleanly go from the subsystem level type to the driver
level type without having any type erasure in a void *.

Reviewed-by: Dan Williams <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <12-v3-225de1400dfc [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/mdev: Make to_mdev_device() into a static inline

The macro wrongly uses 'dev' as both the macro argument and the member
name, which means it fails compilation if any caller uses a word other
than 'dev' as the single argument. Fix this defect by making it into
proper static inline, which is more clear and typesafe anyhow.

Fixes: 99e3123e3d72 ("vfio-mdev: Make mdev_device private and abstract interfaces")
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <11-v3-225de1400dfc [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/mdev: Use vfio_init/register/unregister_group_dev

mdev gets little benefit because it doesn't actually do anything, however
it is the last user, so move the vfio_init/register/unregister_group_dev()
code here for now.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Liu Yi L <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <10-v3-225de1400dfc [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/pci: Use vfio_init/register/unregister_group_dev

pci already allocates a struct vfio_pci_device with exactly the same
lifetime as vfio_device, switch to the new API and embed vfio_device in
vfio_pci_device.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Liu Yi L <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <9-v3-225de1400dfc [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/pci: Re-order vfio_pci_probe()

vfio_add_group_dev() must be called only after all of the private data in
vdev is fully setup and ready, otherwise there could be races with user
space instantiating a device file descriptor and starting to call ops.

For instance vfio_pci_reflck_attach() sets vdev->reflck and
vfio_pci_open(), called by fops open, unconditionally derefs it, which
will crash if things get out of order.

Fixes: cc20d7999000 ("vfio/pci: Introduce VF token")
Fixes: e309df5b0c9e ("vfio/pci: Parallelize device open and release")
Fixes: 6eb7018705de ("vfio-pci: Move idle devices to D3hot power state")
Fixes: ecaa1f6a0154 ("vfio-pci: Add VGA arbiter client")
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <8-v3-225de1400dfc [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/pci: Move VGA and VF initialization to functions

vfio_pci_probe() is quite complicated, with optional VF and VGA sub
components. Move these into clear init/uninit functions and have a linear
flow in probe/remove.

This fixes a few little buglets:
- vfio_pci_remove() is in the wrong order, vga_client_register() removes
a notifier and is after kfree(vdev), but the notifier refers to vdev,
so it can use after free in a race.
- vga_client_register() can fail but was ignored

Organize things so destruction order is the reverse of creation order.

Fixes: ecaa1f6a0154 ("vfio-pci: Add VGA arbiter client")
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <7-v3-225de1400dfc [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/fsl-mc: Use vfio_init/register/unregister_group_dev

fsl-mc already allocates a struct vfio_fsl_mc_device with exactly the same
lifetime as vfio_device, switch to the new API and embed vfio_device in
vfio_fsl_mc_device. While here remove the devm usage for the vdev, this
code is clean and doesn't need devm.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <6-v3-225de1400dfc [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/fsl-mc: Re-order vfio_fsl_mc_probe()

vfio_add_group_dev() must be called only after all of the private data in
vdev is fully setup and ready, otherwise there could be races with user
space instantiating a device file descriptor and starting to call ops.

For instance vfio_fsl_mc_reflck_attach() sets vdev->reflck and
vfio_fsl_mc_open(), called by fops open, unconditionally derefs it, which
will crash if things get out of order.

This driver started life with the right sequence, but two commits added
stuff after vfio_add_group_dev().

Fixes: 2e0d29561f59 ("vfio/fsl-mc: Add irq infrastructure for fsl-mc devices")
Fixes: f2ba7e8c947b ("vfio/fsl-mc: Added lock support in preparation for interrupt handling")
Co-developed-by: Diana Craciun OSS <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <5-v3-225de1400dfc [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/platform: Use vfio_init/register/unregister_group_dev

platform already allocates a struct vfio_platform_device with exactly
the same lifetime as vfio_device, switch to the new API and embed
vfio_device in vfio_platform_device.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Acked-by: Eric Auger <[email protected]>
Tested-by: Eric Auger <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <4-v3-225de1400dfc [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio: Split creation of a vfio_device into init and register ops

This makes the struct vfio_device part of the public interface so it
can be used with container_of and so forth, as is typical for a Linux
subystem.

This is the first step to bring some type-safety to the vfio interface by
allowing the replacement of 'void *' and 'struct device *' inputs with a
simple and clear 'struct vfio_device *'

For now the self-allocating vfio_add_group_dev() interface is kept so each
user can be updated as a separate patch.

The expected usage pattern is

  driver core probe() function:
     my_device = kzalloc(sizeof(*mydevice));
     vfio_init_group_dev(&my_device->vdev, dev, ops, mydevice);
     /* other driver specific prep */
     vfio_register_group_dev(&my_device->vdev);
     dev_set_drvdata(dev, my_device);

  driver core remove() function:
     my_device = dev_get_drvdata(dev);
     vfio_unregister_group_dev(&my_device->vdev);
     /* other driver specific tear down */
     kfree(my_device);

Allowing the driver to be able to use the drvdata and vfio_device to go
to/from its own data.

The pattern also makes it clear that vfio_register_group_dev() must be
last in the sequence, as once it is called the core code can immediately
start calling ops. The init/register gap is provided to allow for the
driver to do setup before ops can be called and thus avoid races.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Liu Yi L <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <3-v3-225de1400dfc [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio: Simplify the lifetime logic for vfio_device

The vfio_device is using a 'sleep until all refs go to zero' pattern for
its lifetime, but it is indirectly coded by repeatedly scanning the group
list waiting for the device to be removed on its own.

Switch this around to be a direct representation, use a refcount to count
the number of places that are blocking destruction and sleep directly on a
completion until that counter goes to zero. kfree the device after other
accesses have been excluded in vfio_del_group_dev(). This is a fairly
common Linux idiom.

Due to this we can now remove kref_put_mutex(), which is very rarely used
in the kernel. Here it is being used to prevent a zero ref device from
being seen in the group list. Instead allow the zero ref device to
continue to exist in the device_list and use refcount_inc_not_zero() to
exclude it once refs go to zero.

This patch is organized so the next patch will be able to alter the API to
allow drivers to provide the kfree.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <2-v3-225de1400dfc [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio: Remove extra put/gets around vfio_device->group

The vfio_device->group value has a get obtained during
vfio_add_group_dev() which gets moved from the stack to vfio_device->group
in vfio_group_create_device().

The reference remains until we reach the end of vfio_del_group_dev() when
it is put back.

Thus anything that already has a kref on the vfio_device is guaranteed a
valid group pointer. Remove all the extra reference traffic.

It is tricky to see, but the get at the start of vfio_del_group_dev() is
actually pairing with the put hidden inside vfio_device_put() a few lines
below.

A later patch merges vfio_group_create_device() into vfio_add_group_dev()
which makes the ownership and error flow on the create side easier to
follow.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Max Gurtovoy <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>
Message-Id: <1-v3-225de1400dfc [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/pci: remove vfio_pci_nvlink2

This driver never had any open userspace (which for VFIO would include
VM kernel drivers) that use it, and thus should never have been added
by our normal userspace ABI rules.

Signed-off-by: Christoph Hellwig <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>
Message-Id: <20210326061311.1497642 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/type1: Remove the almost unused check in vfio_iommu_type1_unpin_pages

The check i > npage at the end of vfio_iommu_type1_unpin_pages is unused
unless npage < 0, but if npage < 0, this function will return npage, which
should return -EINVAL instead. So let's just check the parameter npage at
the start of the function. By the way, replace unpin_exit with break.

Signed-off-by: Shenming Lu <[email protected]>
Message-Id: <20210406135009 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/platform: Fix spelling mistake "registe" -> "register"

There is a spelling mistake in a comment, fix it.

Signed-off-by: Zhen Lei <[email protected]>
Acked-by: Eric Auger <[email protected]>
Message-Id: <20210326083528 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/pci: fix a couple of spelling mistakes

There are several spelling mistakes, as follows:
thru ==> through
presense ==> presence

Signed-off-by: Zhen Lei <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Message-Id: <20210326083528 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/mdev: Fix spelling mistake "interal" -> "internal"

There is a spelling mistake in a comment, fix it.

Signed-off-by: Zhen Lei <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Message-Id: <20210326083528 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/type1: fix a couple of spelling mistakes

There are several spelling mistakes, as follows:
userpsace ==> userspace
Accouting ==> Accounting
exlude ==> exclude

Signed-off-by: Zhen Lei <[email protected]>
Reviewed-by: Eric Auger <[email protected]>
Message-Id: <20210326083528 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/pci: Add support for opregion v2.1+

Before opregion version 2.0 VBT data is stored in opregion mailbox #4,
but when VBT data exceeds 6KB size and cannot be within mailbox #4
then from opregion v2.0+, Extended VBT region, next to opregion is
used to hold the VBT data, so the total size will be opregion size plus
extended VBT region size.

Since opregion v2.0 with physical host VBT address would not be
practically available for end user and guest can not directly access
host physical address, so it is not supported.

Cc: Zhenyu Wang <[email protected]>
Signed-off-by: Swee Yee Fonn <[email protected]>
Signed-off-by: Fred Gao <[email protected]>
Message-Id: <20210325170953 [email protected]>
Reviewed-by: Zhenyu Wang <[email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio/pci: Remove an unnecessary blank line in vfio_pci_enable

This blank line is unnecessary, so remove it.

Signed-off-by: Zhou Wang <[email protected]>
Message-Id: <1615808073 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

vfio: pci: Spello fix in the file vfio_pci.c

s/permision/permission/

Signed-off-by: Bhaskar Chowdhury <[email protected]>
Acked-by: Randy Dunlap <[email protected]>
Message-Id: <20210314052925 [email protected]>
Signed-off-by: Alex Williamson <[email protected]>

Linux 5.12-rc6

firewire: nosy: Fix a use-after-free bug in nosy_ioctl()

For each device, the nosy driver allocates a pcilynx structure.
A use-after-free might happen in the following scenario:

1. Open nosy device for the first time and call ioctl with command
    NOSY_IOC_START, then a new client A will be malloced and added to
    doubly linked list.
2. Open nosy device for the second time and call ioctl with command
    NOSY_IOC_START, then a new client B will be malloced and added to
    doubly linked list.
3. Call ioctl with command NOSY_IOC_START for client A, then client A
    will be readded to the doubly linked list. Now the doubly linked
    list is messed up.
4. Close the first nosy device and nosy_release will be called. In
    nosy_release, client A will be unlinked and freed.
5. Close the second nosy device, and client A will be referenced,
    resulting in UAF.

The root cause of this bug is that the element in the doubly linked list
is reentered into the list.

Fix this bug by adding a check before inserting a client.  If a client
is already in the linked list, don't insert it.

The following KASAN report reveals it:

   BUG: KASAN: use-after-free in nosy_release+0x1ea/0x210
   Write of size 8 at addr ffff888102ad7360 by task poc
   CPU: 3 PID: 337 Comm: poc Not tainted 5.12.0-rc5+ #6
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
   Call Trace:
     nosy_release+0x1ea/0x210
     __fput+0x1e2/0x840
     task_work_run+0xe8/0x180
     exit_to_user_mode_prepare+0x114/0x120
     syscall_exit_to_user_mode+0x1d/0x40
     entry_SYSCALL_64_after_hwframe+0x44/0xae

   Allocated by task 337:
     nosy_open+0x154/0x4d0
     misc_open+0x2ec/0x410
     chrdev_open+0x20d/0x5a0
     do_dentry_open+0x40f/0xe80
     path_openat+0x1cf9/0x37b0
     do_filp_open+0x16d/0x390
     do_sys_openat2+0x11d/0x360
     __x64_sys_open+0xfd/0x1a0
     do_syscall_64+0x33/0x40
     entry_SYSCALL_64_after_hwframe+0x44/0xae

   Freed by task 337:
     kfree+0x8f/0x210
     nosy_release+0x158/0x210
     __fput+0x1e2/0x840
     task_work_run+0xe8/0x180
     exit_to_user_mode_prepare+0x114/0x120
     syscall_exit_to_user_mode+0x1d/0x40
     entry_SYSCALL_64_after_hwframe+0x44/0xae

   The buggy address belongs to the object at ffff888102ad7300 which belongs to the cache kmalloc-128 of size 128
   The buggy address is located 96 bytes inside of 128-byte region [ffff888102ad7300, ffff888102ad7380)

[ Modified to use 'list_empty()' inside proper lock  - Linus ]

Link: https://lore.kernel.org/lkml/[email protected]/
Reported-and-tested-by: 马哲宇 (Zheyu Ma) <[email protected]>
Signed-off-by: Zheyu Ma <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Stefan Richter <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'for-linus' of git://github.com/openrisc/linux

Pull OpenRISC fix from Stafford Horne:
"Fix duplicate header include in Litex SOC driver"

* tag 'for-linus' of git://github.com/openrisc/linux:
soc: litex: Remove duplicated header file inclusion

Merge tag 'io_uring-5.12-2021-04-03' of git://git.kernel.dk/linux-block

POull io_uring fix from Jens Axboe:
"Just fixing a silly braino in a previous patch, where we'd end up
  failing to compile if CONFIG_BLOCK isn't enabled.

  Not that a lot of people do that, but kernel bot spotted it and it's
  probably prudent to just flush this out now before -rc6.

  Sorry about that, none of my test compile configs have !CONFIG_BLOCK"

* tag 'io_uring-5.12-2021-04-03' of git://git.kernel.dk/linux-block:
  io_uring: fix !CONFIG_BLOCK compilation failure

soc: litex: Remove duplicated header file inclusion

The header file <linux/errno.h> is already included above and can be
removed here.

Signed-off-by: Zhen Lei <[email protected]>
Signed-off-by: Mateusz Holenko <[email protected]>
Signed-off-by: Stafford Horne <[email protected]>

Merge tag 'gfs2-v5.12-rc2-fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fixes from Andreas Gruenbacher:
"Two more gfs2 fixes"

* tag 'gfs2-v5.12-rc2-fixes2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: report "already frozen/thawed" errors
gfs2: Flag a withdraw if init_threads() fails

Merge tag 'riscv-for-linus-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:
"A handful of fixes for 5.12:

   - fix a stack tracing regression related to "const register asm"
     variables, which have unexpected behavior.

   - ensure the value to be written by put_user() is evaluated before
     enabling access to userspace memory..

   - align the exception vector table correctly, so we don't rely on the
     firmware's handling of unaligned accesses.

   - build fix to make NUMA depend on MMU, which triggered on some
     randconfigs"

* tag 'riscv-for-linus-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: Make NUMA depend on MMU
  riscv: remove unneeded semicolon
  riscv,entry: fix misaligned base for excp_vect_table
  riscv: evaluate put_user() arg before enabling user access
  riscv: Drop const annotation for sp

Merge tag 'powerpc-5.12-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:
"Fix a bug on pseries where spurious wakeups from H_PROD would prevent
  partition migration from succeeding.

  Fix oopses seen in pcpu_alloc(), caused by parallel faults of the
  percpu mapping causing us to corrupt the protection key used for the
  mapping, and cause a fatal key fault.

  Thanks to Aneesh Kumar K.V, Murilo Opsfelder Araujo, and Nathan Lynch"

* tag 'powerpc-5.12-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/mm/book3s64: Use the correct storage key value when calling H_PROTECT
  powerpc/pseries/mobility: handle premature return from H_JOIN
  powerpc/pseries/mobility: use struct for shared state

Merge tag 'hyperv-fixes-signed-20210402' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux

Pull Hyper-V fixes from Wei Liu:
"One fix from Lu Yunlong for a double free in hvfb_probe"

* tag 'hyperv-fixes-signed-20210402' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
video: hyperv_fb: Fix a double free in hvfb_probe

Merge tag 'driver-core-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fix from Greg KH:
"Here is a single driver core fix for a reported problem with differed
  probing. It has been in linux-next for a while with no reported
  problems"

* tag 'driver-core-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  driver core: clear deferred probe reason on probe retry

Merge tag 'char-misc-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver fixes from Greg KH:
"Here are a few small driver char/misc changes for 5.12-rc6.

  Nothing major here, a few fixes for reported issues:

   - interconnect fixes for problems found

   - fbcon syzbot-found fix

   - extcon fixes

   - firmware stratix10 bugfix

   - MAINTAINERS file update.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'char-misc-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  drivers: video: fbcon: fix NULL dereference in fbcon_cursor()
  mei: allow map and unmap of client dma buffer only for disconnected client
  MAINTAINERS: Add linux-phy list and patchwork
  interconnect: Fix kerneldoc warning
  firmware: stratix10-svc: reset COMMAND_RECONFIG_FLAG_PARTIAL to 0
  extcon: Fix error handling in extcon_dev_register
  extcon: Add stubs for extcon_register_notifier_all() functions
  interconnect: core: fix error return code of icc_link_destroy()
  interconnect: qcom: msm8939: remove rpm-ids from non-RPM nodes

Merge tag 'staging-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging driver fixes from Greg KH:
"Here are two rtl8192e staging driver fixes for reported problems.

  Both of these have been in linux-next for a while with no reported
  issues"

* tag 'staging-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: rtl8192e: Change state information from u16 to u8
  staging: rtl8192e: Fix incorrect source in memcpy()

Merge tag 'tty-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull serial driver fix from Greg KH:
"Here is a single serial driver fix for 5.12-rc6. Is is a revert of a
  change that showed up in 5.9 that has been reported to cause problems.

  It has been in linux-next for a while with no reported issues"

* tag 'tty-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  soc: qcom-geni-se: Cleanup the code to remove proxy votes

Merge tag 'usb-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
"Here are a few small USB driver fixes for 5.12-rc6 to resolve reported
  problems.

  They include:

   - a number of cdc-acm fixes for reported problems. It seems more
     people are using this driver lately...

   - dwc3 driver fixes for reported problems, and fixes for the fixes :)

   - dwc2 driver fixes for reported issues.

   - musb driver fix.

   - new USB quirk additions.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'usb-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (23 commits)
  usb: dwc2: Prevent core suspend when port connection flag is 0
  usb: dwc2: Fix HPRT0.PrtSusp bit setting for HiKey 960 board.
  usb: musb: Fix suspend with devices connected for a64
  usb: xhci-mtk: fix broken streams issue on 0.96 xHCI
  usb: dwc3: gadget: Clear DEP flags after stop transfers in ep disable
  usbip: vhci_hcd fix shift out-of-bounds in vhci_hub_control()
  USB: quirks: ignore remote wake-up on Fibocom L850-GL LTE modem
  USB: cdc-acm: do not log successful probe on later errors
  USB: cdc-acm: always claim data interface
  USB: cdc-acm: use negation for NULL checks
  USB: cdc-acm: clean up probe error labels
  USB: cdc-acm: drop redundant driver-data reset
  USB: cdc-acm: drop redundant driver-data assignment
  USB: cdc-acm: fix use-after-free after probe failure
  USB: cdc-acm: fix double free on probe failure
  USB: cdc-acm: downgrade message to debug
  USB: cdc-acm: untangle a circular dependency between callback and softint
  cdc-acm: fix BREAK rx code path adding necessary calls
  usb: gadget: udc: amd5536udc_pci fix null-ptr-dereference
  usb: dwc3: pci: Enable dis_uX_susphy_quirk for Intel Merrifield
  ...

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fix from James Bottomley:
"A single fix to iscsi for a rare race condition which can cause a
kernel panic"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: iscsi: Fix race condition between login and sync thread

io_uring: fix !CONFIG_BLOCK compilation failure

kernel test robot correctly pinpoints a compilation failure if
CONFIG_BLOCK isn't set:

fs/io_uring.c: In function '__io_complete_rw':
>> fs/io_uring.c:2509:48: error: implicit declaration of function 'io_rw_should_reissue'; did you mean 'io_rw_reissue'? [-Werror=implicit-function-declaration]
    2509 |  if ((res == -EAGAIN || res == -EOPNOTSUPP) && io_rw_should_reissue(req)) {
         |                                                ^~~~~~~~~~~~~~~~~~~~
         |                                                io_rw_reissue
    cc1: some warnings being treated as errors

Ensure that we have a stub declaration of io_rw_should_reissue() for
!CONFIG_BLOCK.

Fixes: 230d50d448ac ("io_uring: move reissue into regular IO path")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

Merge tag 'block-5.12-2021-04-02' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

- Remove comment that never came to fruition in 22 years of development
   (Christoph)

- Remove unused request flag (Christoph)

- Fix for null_blk fake timeout handling (Damien)

- Fix for IOCB_NOWAIT being ignored for O_DIRECT on raw bdevs (Pavel)

- Error propagation fix for multiple split bios (Yufen)

* tag 'block-5.12-2021-04-02' of git://git.kernel.dk/linux-block:
  block: remove the unused RQF_ALLOCED flag
  block: update a few comments in uapi/linux/blkpg.h
  block: don't ignore REQ_NOWAIT for direct IO
  null_blk: fix command timeout completion handling
  block: only update parent bi_status when bio fail

Merge tag 'io_uring-5.12-2021-04-02' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
"Nothing really major in here, and finally nothing really related to
  signals. A few minor fixups related to the threading changes, and some
  general fixes, that's it.

  There's the pending gdb-get-confused-about-arch, but that's more of a
  cosmetic issue, nothing that hinder use of it. And given that other
  archs will likely be affected by that oddity too, better to postpone
  any changes there until 5.13 imho"

* tag 'io_uring-5.12-2021-04-02' of git://git.kernel.dk/linux-block:
  io_uring: move reissue into regular IO path
  io_uring: fix EIOCBQUEUED iter revert
  io_uring/io-wq: protect against sprintf overflow
  io_uring: don't mark S_ISBLK async work as unbounded
  io_uring: drop sqd lock before handling signals for SQPOLL
  io_uring: handle setup-failed ctx in kill_timeouts
  io_uring: always go for cancellation spin on exec

Merge tag 'acpi-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fixes from Rafael Wysocki:
"These fix an ACPI tables management issue, an issue related to the
  ACPI enumeration of devices and CPU wakeup in the ACPI processor
  driver.

  Specifics:

   - Ensure that the memory occupied by ACPI tables on x86 will always
     be reserved to prevent it from being allocated for other purposes
     which was possible in some cases (Rafael Wysocki).

   - Fix the ACPI device enumeration code to prevent it from attempting
     to evaluate the _STA control method for devices with unmet
     dependencies which is likely to fail (Hans de Goede).

   - Fix the handling of CPU0 wakeup in the ACPI processor driver to
     prevent CPU0 online failures from occurring (Vitaly Kuznetsov)"

* tag 'acpi-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()
  ACPI: scan: Fix _STA getting called on devices with unmet dependencies
  ACPI: tables: x86: Reserve memory occupied by ACPI tables

Merge tag 'pm-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
"These fix a race condition and an ordering issue related to using
  device links in the runtime PM framework and two kerneldoc comments in
  cpufreq.

  Specifics:

   - Fix race condition related to the handling of supplier devices
     during consumer device probe and fix the order of decrementation of
     two related reference counters in the runtime PM core code handling
     supplier devices (Adrian Hunter).

   - Fix kerneldoc comments in cpufreq that have not been updated along
     with the functions documented by them (Geert Uytterhoeven)"

* tag 'pm-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: runtime: Fix race getting/putting suppliers at probe
  PM: runtime: Fix ordering in pm_runtime_get_suppliers()
  cpufreq: Fix scaling_{available,boost}_frequencies_show() comments

block: remove the unused RQF_ALLOCED flag

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: update a few comments in uapi/linux/blkpg.h

The big top of the file comment talk about grand plans that never
happened, so remove them to not confuse the readers. Also mark the
devname and volname fields as ignored as they were never used by the
kernel.

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

Merge tag 'trace-v5.12-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
"Fix stack trace entry size to stop showing garbage

  The macro that creates both the structure and the format displayed to
  user space for the stack trace event was changed a while ago to fix
  the parsing by user space tooling. But this change also modified the
  structure used to store the stack trace event. It changed the caller
  array field from [0] to [8].

  Even though the size in the ring buffer is dynamic and can be
  something other than 8 (user space knows how to handle this), the 8
  extra words was not accounted for when reserving the event on the ring
  buffer, and added 8 more entries, due to the calculation of
  "sizeof(*entry) + nr_entries * sizeof(long)", as the sizeof(*entry)
  now contains 8 entries.

  The size of the caller field needs to be subtracted from the size of
  the entry to create the correct allocation size"

* tag 'trace-v5.12-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix stack trace event size

io_uring: move reissue into regular IO path

It's non-obvious how retry is done for block backed files, when it happens
off the kiocb done path. It also makes it tricky to deal with the iov_iter
handling.

Just mark the req as needing a reissue, and handling it from the
submission path instead. This makes it directly obvious that we're not
re-importing the iovec from userspace past the submit point, and it means
that we can just reuse our usual -EAGAIN retry path from the read/write
handling.

At some point in the future, we'll gain the ability to always reliably
return -EAGAIN through the stack. A previous attempt on the block side
didn't pan out and got reverted, hence the need to check for this
information out-of-band right now.

Signed-off-by: Jens Axboe <[email protected]>

Merge branches 'acpi-tables' and 'acpi-scan'

* acpi-tables:
ACPI: tables: x86: Reserve memory occupied by ACPI tables

* acpi-scan:
ACPI: scan: Fix _STA getting called on devices with unmet dependencies

Merge branch 'pm-cpufreq'

* pm-cpufreq:
cpufreq: Fix scaling_{available,boost}_frequencies_show() comments

block: don't ignore REQ_NOWAIT for direct IO

If IOCB_NOWAIT is set on submission, then that needs to get propagated to
REQ_NOWAIT on the block side. Otherwise we completely lose this
information, and any issuer of IOCB_NOWAIT IO will potentially end up
blocking on eg request allocation on the storage side.

Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

riscv: Make NUMA depend on MMU

NUMA is useless when NOMMU, and it leads some build error,
make it depend on MMU.

Reported-by: kernel test robot <[email protected]>
Signed-off-by: Kefeng Wang <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: remove unneeded semicolon

Eliminate the following coccicheck warning:
./arch/riscv/mm/kasan_init.c:219:2-3: Unneeded semicolon

Reported-by: Abaci Robot <[email protected]>
Signed-off-by: Yang Li <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv,entry: fix misaligned base for excp_vect_table

In RV64, the size of each entry in excp_vect_table is 8 bytes. If the
base of the table is not 8-byte aligned, loading an entry in the table
will raise a misaligned exception. Although such exception will be
handled by opensbi/bbl, this still causes performance degradation.

Signed-off-by: Zihao Yu <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: evaluate put_user() arg before enabling user access

The <asm/uaccess.h> header has a problem with put_user(a, ptr) if
the 'a' is not a simple variable, such as a function. This can lead
to the compiler producing code as so:

1: enable_user_access()
2: evaluate 'a' into register 'r'
3: put 'r' to 'ptr'
4: disable_user_acess()

The issue is that 'a' is now being evaluated with the user memory
protections disabled. So we try and force the evaulation by assigning
'x' to __val at the start, and hoping the compiler barriers in
enable_user_access() do the job of ordering step 2 before step 1.

This has shown up in a bug where 'a' sleeps and thus schedules out
and loses the SR_SUM flag. This isn't sufficient to fully fix, but
should reduce the window of opportunity. The first instance of this
we found is in scheudle_tail() where the code does:

$ less -N kernel/sched/core.c

4263 if (current->set_child_tid)
4264 put_user(task_pid_vnr(current), current->set_child_tid);

Here, the task_pid_vnr(current) is called within the block that has
enabled the user memory access. This can be made worse with KASAN
which makes task_pid_vnr() a rather large call with plenty of
opportunity to sleep.

Signed-off-by: Ben Dooks <[email protected]>
Reported-by: [email protected]
Suggested-by: Arnd Bergman <[email protected]>
--
Changes since v1:
- fixed formatting and updated the patch description with more info

Changes since v2:
- fixed commenting on __put_user() ([email protected])

Change since v3:
- fixed RFC in patch title. Should be ready to merge.

Signed-off-by: Palmer Dabbelt <[email protected]>

riscv: Drop const annotation for sp

The const annotation should not be used for 'sp', or it will
become read only and lead to bad stack output.

Fixes: dec822771b01 ("riscv: stacktrace: Move register keyword to beginning of declaration")
Signed-off-by: Kefeng Wang <[email protected]>
Signed-off-by: Palmer Dabbelt <[email protected]>

Merge tag 'lto-v5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull LTO fix from Kees Cook:
"It seems that there is a bug in ld.bfd when doing module section
  merging.

  As explicit merging is only needed for LTO, the work-around is to only
  do it under LTO, leaving the original section layout choices alone
  under normal builds:

   - Only perform explicit module section merges under LTO (Sean
     Christopherson)"

* tag 'lto-v5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  kbuild: lto: Merge module sections if and only if CONFIG_LTO_CLANG is enabled

kbuild: lto: Merge module sections if and only if CONFIG_LTO_CLANG is enabled

Merge module sections only when using Clang LTO. With ld.bfd, merging
sections does not appear to update the symbol tables for the module,
e.g. 'readelf -s' shows the value that a symbol would have had, if
sections were not merged. ld.lld does not show this problem.

The stale symbol table breaks gdb's function disassembler, and presumably
other things, e.g.

gdb -batch -ex "file arch/x86/kvm/kvm.ko" -ex "disassemble kvm_init"

reads the wrong bytes and dumps garbage.

Fixes: dd2776222abb ("kbuild: lto: merge module sections")
Cc: Nick Desaulniers <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Reviewed-by: Sami Tolvanen <[email protected]>
Tested-by: Sami Tolvanen <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"It's a bit larger than I (and probably you) would like by the time we
  get to -rc6, but perhaps not entirely unexpected since the changes in
  the last merge window were larger than usual.

  x86:
   - Fixes for missing TLB flushes with TDP MMU

   - Fixes for race conditions in nested SVM

   - Fixes for lockdep splat with Xen emulation

   - Fix for kvmclock underflow

   - Fix srcdir != builddir builds

   - Other small cleanups

  ARM:
   - Fix GICv3 MMIO compatibility probing

   - Prevent guests from using the ARMv8.4 self-hosted tracing
     extension"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  selftests: kvm: Check that TSC page value is small after KVM_SET_CLOCK(0)
  KVM: x86: Prevent 'hv_clock->system_time' from going negative in kvm_guest_time_update()
  KVM: x86: disable interrupts while pvclock_gtod_sync_lock is taken
  KVM: x86: reduce pvclock_gtod_sync_lock critical sections
  KVM: SVM: ensure that EFER.SVME is set when running nested guest or on nested vmexit
  KVM: SVM: load control fields from VMCB12 before checking them
  KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages
  KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping
  KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap
  KVM: make: Fix out-of-source module builds
  selftests: kvm: make hardware_disable_test less verbose
  KVM: x86/vPMU: Forbid writing to MSR_F15H_PERF MSRs when guest doesn't have X86_FEATURE_PERFCTR_CORE
  KVM: x86: remove unused declaration of kvm_write_tsc()
  KVM: clean up the unused argument
  tools/kvm_stat: Add restart delay
  KVM: arm64: Fix CPU interface MMIO compatibility detection
  KVM: arm64: Disable guest access to trace filter controls
  KVM: arm64: Hide system instruction access to Trace registers

Merge tag 'drm-fixes-2021-04-02' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
"Things have settled down in time for Easter, a random smattering of
  small fixes across a few drivers.

  I'm guessing though there might be some i915 and misc fixes out there
  I haven't gotten yet, but since today is a public holiday here, I'm
  sending this early so I can have the day off, I'll see if more
  requests come in and decide what to do with them later.

  amdgpu:
   - Polaris idle power fix
   - VM fix
   - Vangogh S3 fix
   - Fixes for non-4K page sizes

  amdkfd:
   - dqm fence memory corruption fix

  tegra:
   - lockdep warning fix
   - runtine PM reference fix
   - display controller fix
   - PLL Fix

  imx:
   - memory leak in error path fix
   - LDB driver channel registration fix
   - oob array warning in LDB driver

  exynos
   - unused header file removal"

* tag 'drm-fixes-2021-04-02' of git://anongit.freedesktop.org/drm/drm:
  drm/amdgpu: check alignment on CPU page for bo map
  drm/amdgpu: Set a suitable dev_info.gart_page_size
  drm/amdgpu/vangogh: don't check for dpm in is_dpm_running when in suspend
  drm/amdkfd: dqm fence memory corruption
  drm/tegra: sor: Grab runtime PM reference across reset
  drm/tegra: dc: Restore coupling of display controllers
  gpu: host1x: Use different lock classes for each client
  drm/tegra: dc: Don't set PLL clock to 0Hz
  drm/amdgpu: fix offset calculation in amdgpu_vm_bo_clear_mappings()
  drm/amd/pm: no need to force MCLK to highest when no display connected
  drm/exynos/decon5433: Remove the unused include statements
  drm/imx: imx-ldb: fix out of bounds array access warning
  drm/imx: imx-ldb: Register LDB channel1 when it is the only channel to be used
  drm/imx: fix memory leak when fails to init

Merge tag 'imx-drm-fixes-2021-04-01' of git://git.pengutronix.de/git/pza/linux into drm-fixes

drm/imx: imx-drm-core and imx-ldb fixes

Fix a memory leak in an error path during DRM device initialization,
fix the LDB driver to register channel 1 even if channel 0 is unused,
and fix an out of bounds array access warning in the LDB driver.

Signed-off-by: Dave Airlie <[email protected]>
From: Philipp Zabel <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

Merge tag 'drm/tegra/for-5.12-rc6' of ssh://git.freedesktop.org/git/tegra/linux into drm-fixes

drm/tegra: Fixes for v5.12-rc6

This contains a couple of fixes for various issues such as lockdep
warnings, runtime PM references, coupled display controllers and
misconfigured PLLs.

Signed-off-by: Dave Airlie <[email protected]>
From: Thierry Reding <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

tracing: Fix stack trace event size

Commit cbc3b92ce037 fixed an issue to modify the macros of the stack trace
event so that user space could parse it properly. Originally the stack
trace format to user space showed that the called stack was a dynamic
array. But it is not actually a dynamic array, in the way that other
dynamic event arrays worked, and this broke user space parsing for it. The
update was to make the array look to have 8 entries in it. Helper
functions were added to make it parse it correctly, as the stack was
dynamic, but was determined by the size of the event stored.

Although this fixed user space on how it read the event, it changed the
internal structure used for the stack trace event. It changed the array
size from [0] to [8] (added 8 entries). This increased the size of the
stack trace event by 8 words. The size reserved on the ring buffer was the
size of the stack trace event plus the number of stack entries found in
the stack trace. That commit caused the amount to be 8 more than what was
needed because it did not expect the caller field to have any size. This
produced 8 entries of garbage (and reading random data) from the stack
trace event:

<idle>-0 [002] d... 1976396.837549: <stack trace>
=> trace_event_raw_event_sched_switch
=> __traceiter_sched_switch
=> __schedule
=> schedule_idle
=> do_idle
=> cpu_startup_entry
=> secondary_startup_64_no_verify
=> 0xc8c5e150ffff93de
=> 0xffff93de
=> 0
=> 0
=> 0xc8c5e17800000000
=> 0x1f30affff93de
=> 0x00000004
=> 0x200000000

Instead, subtract the size of the caller field from the size of the event
to make sure that only the amount needed to store the stack trace is
reserved.

Link: https://lore.kernel.org/lkml/[email protected]/
Cc: [email protected]
Fixes: cbc3b92ce037 ("tracing: Set kernel_stack's caller size properly")
Reported-by: Vasily Gorbik <[email protected]>
Tested-by: Vasily Gorbik <[email protected]>
Acked-by: Vasily Gorbik <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>

Merge tag 'sound-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"Things seem calming down, only usual device-specific fixes for
  HD-audio and USB-audio at this time"

* tag 'sound-5.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda/realtek: fix mute/micmute LEDs for HP 640 G8
  ALSA: hda: Add missing sanity checks in PM prepare/complete callbacks
  ALSA: hda: Re-add dropped snd_poewr_change_state() calls
  ALSA: usb-audio: Apply sample rate quirk to Logitech Connect
  ALSA: hda/realtek: call alc_update_headset_mode() in hp_automute_hook
  ALSA: hda/realtek: fix a determine_headset_type issue for a Dell AIO

Merge tag 'tomoyo-pr-20210401' of git://git.osdn.net/gitroot/tomoyo/tomoyo-test1

Pull tomory fix from Tetsuo Handa:
"An update on 'tomoyo: recognize kernel threads correctly' from Jens
Axboe to not special case PF_IO_WORKER for PF_KTHREAD"

* tag 'tomoyo-pr-20210401' of git://git.osdn.net/gitroot/tomoyo/tomoyo-test1:
tomoyo: don't special case PF_IO_WORKER for PF_KTHREAD

Merge tag 'xarray-5.12' of git://git.infradead.org/users/willy/xarray

Pull XArray fixes from Matthew Wilcox:
"My apologies for the lateness of this. I had a bug reported in the
  test suite, and when I started working on it, I realised I had two
  fixes sitting in the xarray tree since last November. Anyway,
  everything here is fixes, apart from adding xa_limit_16b. The test
  suite passes.

  Summary:

   - Fix a bug when splitting to a non-zero order

   - Documentation fix

   - Add a predefined 16-bit allocation limit

   - Various test suite fixes"

* tag 'xarray-5.12' of git://git.infradead.org/users/willy/xarray:
  idr test suite: Improve reporting from idr_find_test_1
  idr test suite: Create anchor before launching throbber
  idr test suite: Take RCU read lock in idr_find_test_1
  radix tree test suite: Register the main thread with the RCU library
  radix tree test suite: Fix compilation
  XArray: Add xa_limit_16b
  XArray: Fix splitting to non-zero orders
  XArray: Fix split documentation

io_uring: fix EIOCBQUEUED iter revert

iov_iter_revert() is done in completion handlers that happensf before
read/write returns -EIOCBQUEUED, no need to repeat reverting afterwards.
Moreover, even though it may appear being just a no-op, it's actually
races with 1) user forging a new iovec of a different size 2) reissue,
that is done via io-wq continues completely asynchronously.

Fixes: 3e6a0d3c7571c ("io_uring: fix -EAGAIN retry with IOPOLL")
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

io_uring/io-wq: protect against sprintf overflow

task_pid may be large enough to not fit into the left space of
TASK_COMM_LEN-sized buffers and overflow in sprintf. We not so care
about uniqueness, so replace it with safer snprintf().

Reported-by: Alexey Dobriyan <[email protected]>
Signed-off-by: Pavel Begunkov <[email protected]>
Link: https://lore.kernel.org/r/1702c6145d7e1c46fbc382f28334c02e1a3d3994.1617267273.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <[email protected]>

io_uring: don't mark S_ISBLK async work as unbounded

S_ISBLK is marked as unbounded work for async preparation, because it
doesn't match S_ISREG. That is incorrect, as any read/write to a block
device is also a bounded operation. Fix it up and ensure that S_ISBLK
isn't marked unbounded.

Signed-off-by: Jens Axboe <[email protected]>

null_blk: fix command timeout completion handling

Memory backed or zoned null block devices may generate actual request
timeout errors due to the submission path being blocked on memory
allocation or zone locking. Unlike fake timeouts or injected timeouts,
the request submission path will call blk_mq_complete_request() or
blk_mq_end_request() for these real timeout errors, causing a double
completion and use after free situation as the block layer timeout
handler executes blk_mq_rq_timed_out() and __blk_mq_free_request() in
blk_mq_check_expired(). This problem often triggers a NULL pointer
dereference such as:

BUG: kernel NULL pointer dereference, address: 0000000000000050
RIP: 0010:blk_mq_sched_mark_restart_hctx+0x5/0x20
...
Call Trace:
  dd_finish_request+0x56/0x80
  blk_mq_free_request+0x37/0x130
  null_handle_cmd+0xbf/0x250 [null_blk]
  ? null_queue_rq+0x67/0xd0 [null_blk]
  blk_mq_dispatch_rq_list+0x122/0x850
  __blk_mq_do_dispatch_sched+0xbb/0x2c0
  __blk_mq_sched_dispatch_requests+0x13d/0x190
  blk_mq_sched_dispatch_requests+0x30/0x60
  __blk_mq_run_hw_queue+0x49/0x90
  process_one_work+0x26c/0x580
  worker_thread+0x55/0x3c0
  ? process_one_work+0x580/0x580
  kthread+0x134/0x150
  ? kthread_create_worker_on_cpu+0x70/0x70
  ret_from_fork+0x1f/0x30

This problem very often triggers when running the full btrfs xfstests
on a memory-backed zoned null block device in a VM with limited amount
of memory.

Avoid this by executing blk_mq_complete_request() in null_timeout_rq()
only for commands that are marked for a fake timeout completion using
the fake_timeout boolean in struct null_cmd. For timeout errors injected
through debugfs, the timeout handler will execute
blk_mq_complete_request()i as before. This is safe as the submission
path does not execute complete requests in this case.

In null_timeout_rq(), also make sure to set the command error field to
BLK_STS_TIMEOUT and to propagate this error through to the request
completion.

Reported-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Damien Le Moal <[email protected]>
Tested-by: Johannes Thumshirn <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>

idr test suite: Improve reporting from idr_find_test_1

Instead of just reporting an assertion failure, report enough information
that we can start diagnosing exactly went wrong.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

idr test suite: Create anchor before launching throbber

The throbber could race with creation of the anchor entry and cause the
IDR to have zero entries in it, which would cause the test to fail.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

idr test suite: Take RCU read lock in idr_find_test_1

When run on a single CPU, this test would frequently access already-freed
memory. Due to timing, this bug never showed up on multi-CPU tests.

Reported-by: Chris von Recklinghausen <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

radix tree test suite: Register the main thread with the RCU library

Several test runners register individual worker threads with the
RCU library, but neglect to register the main thread, which can lead
to objects being freed while the main thread is in what appears to be
an RCU critical section.

Reported-by: Chris von Recklinghausen <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

ACPI: processor: Fix CPU0 wakeup in acpi_idle_play_dead()

Commit 496121c02127 ("ACPI: processor: idle: Allow probing on platforms
with one ACPI C-state") broke CPU0 hotplug on certain systems, e.g.
I'm observing the following on AWS Nitro (e.g r5b.xlarge but other
instance types are affected as well):

# echo 0 > /sys/devices/system/cpu/cpu0/online
# echo 1 > /sys/devices/system/cpu/cpu0/online
<10 seconds delay>
-bash: echo: write error: Input/output error

In fact, the above mentioned commit only revealed the problem and did
not introduce it. On x86, to wakeup CPU an NMI is being used and
hlt_play_dead()/mwait_play_dead() loops are prepared to handle it:

/*
* If NMI wants to wake up CPU0, start CPU0.
*/
if (wakeup_cpu0())
start_cpu0();

cpuidle_play_dead() -> acpi_idle_play_dead() (which is now being called on
systems where it wasn't called before the above mentioned commit) serves
the same purpose but it doesn't have a path for CPU0. What happens now on
wakeup is:
- NMI is sent to CPU0
- wakeup_cpu0_nmi() works as expected
- we get back to while (1) loop in acpi_idle_play_dead()
- safe_halt() puts CPU0 to sleep again.

The straightforward/minimal fix is add the special handling for CPU0 on x86
and that's what the patch is doing.

Fixes: 496121c02127 ("ACPI: processor: idle: Allow probing on platforms with one ACPI C-state")
Signed-off-by: Vitaly Kuznetsov <[email protected]>
Cc: 5.10+ <[email protected]> # 5.10+
Signed-off-by: Rafael J. Wysocki <[email protected]>

selftests: kvm: Check that TSC page value is small after KVM_SET_CLOCK(0)

Add a test for the issue when KVM_SET_CLOCK(0) call could cause
TSC page value to go very big because of a signedness issue around
hv_clock->system_time.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20210326155551 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Prevent 'hv_clock->system_time' from going negative in kvm_guest_time_update()

When guest time is reset with KVM_SET_CLOCK(0), it is possible for
'hv_clock->system_time' to become a small negative number. This happens
because in KVM_SET_CLOCK handling we set 'kvm->arch.kvmclock_offset' based
on get_kvmclock_ns(kvm) but when KVM_REQ_CLOCK_UPDATE is handled,
kvm_guest_time_update() does (masterclock in use case):

hv_clock.system_time = ka->master_kernel_ns + v->kvm->arch.kvmclock_offset;

And 'master_kernel_ns' represents the last time when masterclock
got updated, it can precede KVM_SET_CLOCK() call. Normally, this is not a
problem, the difference is very small, e.g. I'm observing
hv_clock.system_time = -70 ns. The issue comes from the fact that
'hv_clock.system_time' is stored as unsigned and 'system_time / 100' in
compute_tsc_page_parameters() becomes a very big number.

Use 'master_kernel_ns' instead of get_kvmclock_ns() when masterclock is in
use and get_kvmclock_base_ns() when it's not to prevent 'system_time' from
going negative.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
Message-Id: <20210331124130 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: disable interrupts while pvclock_gtod_sync_lock is taken

pvclock_gtod_sync_lock can be taken with interrupts disabled if the
preempt notifier calls get_kvmclock_ns to update the Xen
runstate information:

   spin_lock include/linux/spinlock.h:354 [inline]
   get_kvmclock_ns+0x25/0x390 arch/x86/kvm/x86.c:2587
   kvm_xen_update_runstate+0x3d/0x2c0 arch/x86/kvm/xen.c:69
   kvm_xen_update_runstate_guest+0x74/0x320 arch/x86/kvm/xen.c:100
   kvm_xen_runstate_set_preempted arch/x86/kvm/xen.h:96 [inline]
   kvm_arch_vcpu_put+0x2d8/0x5a0 arch/x86/kvm/x86.c:4062

So change the users of the spinlock to spin_lock_irqsave and
spin_unlock_irqrestore.

Reported-by: [email protected]
Fixes: 30b5c851af79 ("KVM: x86/xen: Add support for vCPU runstate information")
Cc: David Woodhouse <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: reduce pvclock_gtod_sync_lock critical sections

There is no need to include changes to vcpu->requests into
the pvclock_gtod_sync_lock critical section. The changes to
the shared data structures (in pvclock_update_vm_gtod_copy)
already occur under the lock.

Cc: David Woodhouse <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

Merge branch 'kvm-fix-svm-races' into kvm-master