Git Repo - linux.git/log

IB/umad: Fix kernel crash while unloading ib_umad

When disassociating a device from umad we must ensure that the sysfs
access is prevented before blocking the fops, otherwise assumptions in
syfs don't hold:

    CPU0                     CPU1
ib_umad_kill_port()        ibdev_show()
    port->ib_dev = NULL
                                      dev_name(port->ib_dev)

The prior patch made an error in moving the device_destroy(), it should
have been split into device_del() (above) and put_device() (below). At
this point we already have the split, so move the device_del() back to its
original place.

  kernel stack
  PF: error_code(0x0000) - not-present page
  Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
  RIP: 0010:ibdev_show+0x18/0x50 [ib_umad]
  RSP: 0018:ffffc9000097fe40 EFLAGS: 00010282
  RAX: 0000000000000000 RBX: ffffffffa0441120 RCX: ffff8881df514000
  RDX: ffff8881df514000 RSI: ffffffffa0441120 RDI: ffff8881df1e8870
  RBP: ffffffff81caf000 R08: ffff8881df1e8870 R09: 0000000000000000
  R10: 0000000000001000 R11: 0000000000000003 R12: ffff88822f550b40
  R13: 0000000000000001 R14: ffffc9000097ff08 R15: ffff8882238bad58
  FS:  00007f1437ff3740(0000) GS:ffff888236940000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000000004e8 CR3: 00000001e0dfc001 CR4: 00000000001606e0
  Call Trace:
   dev_attr_show+0x15/0x50
   sysfs_kf_seq_show+0xb8/0x1a0
   seq_read+0x12d/0x350
   vfs_read+0x89/0x140
   ksys_read+0x55/0xd0
   do_syscall_64+0x55/0x1b0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9:

Fixes: cf7ad3030271 ("IB/umad: Avoid destroying device while it is accessed")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Yonatan Cohen <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>

RDMA/mlx5: Fix async events cleanup flows

As in the prior patch, the devx code is not fully cleaning up its
event_lists before finishing driver_destroy allowing a later read to
trigger user after free conditions.

Re-arrange things so that the event_list is always empty after destroy and
ensure it remains empty until the file is closed.

Fixes: f7c8416ccea5 ("RDMA/core: Simplify destruction of FD uobjects")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Yishai Hadas <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>

RDMA/core: Add missing list deletion on freeing event queue

When the uobject file scheme was revised to allow device disassociation
from the file it became possible for read() to still happen the driver
destroys the uobject.

The old clode code was not tolerant to concurrent read, and when it was
moved to the driver destroy it creates a bug.

Ensure the event_list is empty after driver destroy by adding the missing
list_del(). Otherwise read() can trigger a use after free and double
kfree.

Fixes: f7c8416ccea5 ("RDMA/core: Simplify destruction of FD uobjects")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael Guralnik <[email protected]>
Reviewed-by: Yishai Hadas <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>

EDAC/sysfs: Remove csrow objects on errors

All created csrow objects must be removed in the error path of
edac_create_csrow_objects(). The objects have been added as devices.

They need to be removed by doing a device_del() *and* put_device() call
to also free their memory. The missing put_device() leaves a memory
leak. Use device_unregister() instead of device_del() which properly
unregisters the device doing both.

Fixes: 7adc05d2dc3a ("EDAC/sysfs: Drop device references properly")
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Tested-by: John Garry <[email protected]>
Cc: <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

EDAC/mc: Fix use-after-free and memleaks during device removal

A test kernel with the options DEBUG_TEST_DRIVER_REMOVE, KASAN and
DEBUG_KMEMLEAK set, revealed several issues when removing an mci device:

1) Use-after-free:

On 27.11.19 17:07:33, John Garry wrote:
> [   22.104498] BUG: KASAN: use-after-free in
> edac_remove_sysfs_mci_device+0x148/0x180

The use-after-free is caused by the mci_for_each_dimm() macro called in
edac_remove_sysfs_mci_device(). The iterator was introduced with

  c498afaf7df8 ("EDAC: Introduce an mci_for_each_dimm() iterator").

The iterator loop calls device_unregister(&dimm->dev), which removes
the sysfs entry of the device, but also frees the dimm struct in
dimm_attr_release(). When incrementing the loop in mci_for_each_dimm(),
the dimm struct is accessed again, after having been freed already.

The fix is to free all the mci device's subsequent dimm and csrow
objects at a later point, in _edac_mc_free(), when the mci device itself
is being freed.

This keeps the data structures intact and the mci device can be
fully used until its removal. The change allows the safe usage of
mci_for_each_dimm() to release dimm devices from sysfs.

2) Memory leaks:

Following memory leaks have been detected:

# grep edac /sys/kernel/debug/kmemleak | sort | uniq -c
       1     [<000000003c0f58f9>] edac_mc_alloc+0x3bc/0x9d0      # mci->csrows
      16     [<00000000bb932dc0>] edac_mc_alloc+0x49c/0x9d0      # csr->channels
      16     [<00000000e2734dba>] edac_mc_alloc+0x518/0x9d0      # csr->channels[chn]
       1     [<00000000eb040168>] edac_mc_alloc+0x5c8/0x9d0      # mci->dimms
      34     [<00000000ef737c29>] ghes_edac_register+0x1c8/0x3f8 # see edac_mc_alloc()

All leaks are from memory allocated by edac_mc_alloc().

Note: The test above shows that edac_mc_alloc() was called here from
ghes_edac_register(), thus both functions show up in the stack trace
but the module causing the leaks is edac_mc. The comments with the data
structures involved were made manually by analyzing the objdump.

The data structures listed above and created by edac_mc_alloc() are
not properly removed during device removal, which is done in
edac_mc_free().

There are two paths implemented to remove the device depending on device
registration, _edac_mc_free() is called if the device is not registered
and edac_unregister_sysfs() otherwise.

The implemenations differ. For the sysfs case, the mci device removal
lacks the removal of subsequent data structures (csrows, channels,
dimms). This causes the memory leaks (see mci_attr_release()).

[ bp: Massage commit message. ]

Fixes: c498afaf7df8 ("EDAC: Introduce an mci_for_each_dimm() iterator")
Fixes: faa2ad09c01c ("edac_mc: edac_mc_free() cannot assume mem_ctl_info is registered in sysfs.")
Fixes: 7a623c039075 ("edac: rewrite the sysfs code to use struct device")
Reported-by: John Garry <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Tested-by: John Garry <[email protected]>
Cc: <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

ALSA: usb-audio: Add clock validity quirk for Denon MC7000/MCX8000

It should be safe to ignore clock validity check result if the following
conditions are met:
- only one single sample rate is supported;
- the terminal is directly connected to the clock source;
- the clock type is internal.

This is to deal with some Denon DJ controllers that always reports that
clock is invalid.

Tested-by: Tobias Oszlanyi <[email protected]>
Signed-off-by: Alexander Tsoy <[email protected]>
Cc: <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Iwai <[email protected]>

cifs: Fix mode output in debugging statements

A number of the debug statements output file or directory mode
in hex. Change these to print using octal.

Signed-off-by: Frank Sorenson <[email protected]>
Signed-off-by: Steve French <[email protected]>

io-wq: don't call kXalloc_node() with non-online node

Glauber reports a crash on init on a box he has:

RIP: 0010:__alloc_pages_nodemask+0x132/0x340
Code: 18 01 75 04 41 80 ce 80 89 e8 48 8b 54 24 08 8b 74 24 1c c1 e8 0c 48 8b 3c 24 83 e0 01 88 44 24 20 48 85 d2 0f 85 74 01 00 00 <3b> 77 08 0f 82 6b 01 00 00 48 89 7c 24 10 89 ea 48 8b 07 b9 00 02
RSP: 0018:ffffb8be4d0b7c28 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000e8e8
RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000002080
RBP: 0000000000012cc0 R08: 0000000000000000 R09: 0000000000000002
R10: 0000000000000dc0 R11: ffff995c60400100 R12: 0000000000000000
R13: 0000000000012cc0 R14: 0000000000000001 R15: ffff995c60db00f0
FS:  00007f4d115ca900(0000) GS:ffff995c60d80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000002088 CR3: 00000017cca66002 CR4: 00000000007606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
  alloc_slab_page+0x46/0x320
  new_slab+0x9d/0x4e0
  ___slab_alloc+0x507/0x6a0
  ? io_wq_create+0xb4/0x2a0
  __slab_alloc+0x1c/0x30
  kmem_cache_alloc_node_trace+0xa6/0x260
  io_wq_create+0xb4/0x2a0
  io_uring_setup+0x97f/0xaa0
  ? io_remove_personalities+0x30/0x30
  ? io_poll_trigger_evfd+0x30/0x30
  do_syscall_64+0x5b/0x1c0
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f4d116cb1ed

which is due to the 'wqe' and 'worker' allocation being node affine.
But it isn't valid to call the node affine allocation if the node isn't
online.

Setup structures for even offline nodes, as usual, but skip them in
terms of thread setup to not waste resources. If the node isn't online,
just alloc memory with NUMA_NO_NODE.

Reported-by: Glauber Costa <[email protected]>
Tested-by: Glauber Costa <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

hwmon: (pmbus/xdpe12284) fix typo in compatible strings

Make sure that the driver compatible strings matches the binding by
removing the space between the manufacturer and model.

Fixes: aaafb7c8eb1c ("hwmon: (pmbus) Add support for Infineon Multi-phase xdpe122 family controllers")
Cc: Vadim Pasternak <[email protected]>
Signed-off-by: Johan Hovold <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Guenter Roeck <[email protected]>

linux/pipe_fs_i.h: fix kernel-doc warnings after @wait was split

Fix kernel-doc warnings in struct pipe_inode_info after @wait was
split into @rd_wait and @wr_wait.

include/linux/pipe_fs_i.h:66: warning: Function parameter or member 'rd_wait' not described in 'pipe_inode_info'
include/linux/pipe_fs_i.h:66: warning: Function parameter or member 'wr_wait' not described in 'pipe_inode_info'

Fixes: 0ddad21d3e99 ("pipe: use exclusive waits when reading or writing")
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

ice: Trivial fixes

This is a collection of trivial fixes including fixing whitespace, typos,
function headers, reverse Christmas tree, etc.

Signed-off-by: Tony Nguyen <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Use correct netif error function

Use the correct netif_msg_[tx,rx]_error() function to determine whether to
print the MDD event type.

Signed-off-by: Ben Shelton <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Cleanup ice_vsi_alloc_q_vectors

1. Remove local variable num_q_vectors and use vsi->num_q_vectors instead
2. Remove local variable pf and pass vsi->back to ice_pf_to_dev

Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Make print statements more compact

Formatting strings in print function calls (like dev_info, dev_err, etc.)
can exceed 80 columns without making checkpatch unhappy. So remove
newlines where applicable and make print statements more compact.

Signed-off-by: Anirudh Venkataramanan <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Use ice_pf_to_dev

Use ice_pf_to_dev(pf) instead of &pf->pdev->dev
Use ice_pf_to_dev(vsi->back) instead of &vsi->back->pdev->dev
When a pointer to the pf instance is available, use ice_pf_to_dev
instead of ice_hw_to_dev

Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Remove possible null dereference

Commit 1f45ebe0d8fb ("ice: add extra check for null Rx descriptor") moved
the call to ice_construct_skb() under a null check as Coverity reported a
possible use of null skb. However, the original call was not deleted, do so
now.

Fixes: 1f45ebe0d8fb ("ice: add extra check for null Rx descriptor")
Reported-by: Bruce Allan <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: update Unit Load Status bitmask to check after reset

After a reset the Unit Load Status bits in the GLNVM_ULD register to check
for completion should be 0x7FF before continuing. Update the mask to check
(minus the three reserved bits that are always set).

Signed-off-by: Bruce Allan <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: fix and consolidate logging of NVM/firmware version information

Logging the firmware/NVM information during driver load is redundant since
that information is also available via ethtool. Move the functionality
found in ice_nvm_version_str() directly into ice_get_drvinfo() and remove
calling the former and logging that info during driver probe. This also
gets rid of a bug in ice_nvm_version_str() where it returns a pointer to
a buffer which is free'ed when that function exits.

Signed-off-by: Bruce Allan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Modify link message logging

This patch modifies link message logging to include "Full Duplex" and
"Negotiated" for FEC, so as to distinguish it from "Requested" FEC.

Signed-off-by: Akeem G Abodunrin <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Remove CONFIG_PCI_IOV wrap in ice_set_pf_caps

Remove unnecessary CONFIG_PCI_IOV wrapping in ice_set_pf_caps. None
of the data structures accessed within the block are wrapped with
this flag. When CONFIG_PCI_IOV is undefined, pf->num_vfs_supported
will be 0 anyway.

Signed-off-by: Anirudh Venkataramanan <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Remove ice_dev_onetime_setup()

ice_dev_onetime_setup contains driver workarounds needed for
firmware limitations. These issues have now been resolved in newer
NVMs so remove the function.

Signed-off-by: Brett Creeley <[email protected]>
Signed-off-by: Tony Nguyen <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Don't allow same value for Rx tail to be written twice

Currently we compare the value we are about to write to the Rx tail
register with the previous value of next_to_use. The problem with this
is we only write tail on 8 descriptor boundaries, but next_to_use is
updated whenever we clean Rx descriptors. Fix this by comparing the
value we are about to write to tail with the previously written tail
value. This will prevent duplicate Rx tail bumps.

Signed-off-by: Brett Creeley <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: display supported and advertised link modes

Display all of the supported and advertised link modes based on the PHY
capability with media.

Displaying all supported modes is more informative then only displaying
the current link mode.

Signed-off-by: Paul Greenwalt <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Fix switch between FW and SW LLDP

When switching between FW and SW LLDP mode, the
number of configured TLV apps in the driver's
DCB configuration is getting out of synch with
what lldpad thinks is configured. This is causing
a problem when shutting down lldpad. The cleanup
is trying to delete TLV apps that are not defined
in the kernel.

Since the driver is keeping an accurate account
of the apps defined, use the drivers number of
apps to determine if there is an app to delete.
If the number of apps is <= 1, then do not
attempt to delete.

Signed-off-by: Dave Ertman <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

ice: Fix DCB rebuild after reset

The function ice_dcb_rebuild had some logic
flaws in it, and also didn't differentiate
between FW and SW modes needs.

For FW flow, the willing setting was being
forced to OFF and left that way. Unwilling
in DCB FW mode is not a supported model.

Leave the config alone and use the return value
from the set command to determine if setting the
config was successful.

The SW DCB flow does not need to need to register
for MIB change events (as they are not used in
SW mode).

Use !is_sw_lldp checks to only perform FW specific
task while in FW mode.

Also adding a reapplication of the current DCB
config after a link event. Some NVMs are not
maintaining their DCB configs across link events.

Signed-off-by: Dave Ertman <[email protected]>
Tested-by: Andrew Bowers <[email protected]>
Signed-off-by: Jeff Kirsher <[email protected]>

docs: virt: guest-halt-polling.txt convert to ReST

Due to some merge conflict, this file ended being alone under
Documentation/virtual.

The file itself is almost at ReST format. Just minor
adjustments are needed:

- Adjust title markup;
- Adjust a list identation;
- add a literal block markup;
- Add some blank lines.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: review-checklist.txt: rename to ReST

This file is already in ReST compatible format.
So, rename it and add to the kvm's index.rst.

While here, use the standard conversion for document titles.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: Convert timekeeping.txt to ReST format

- Use document title and chapter markups;
- Add markups for literal blocks;
- Add markups for tables;
- use :field: for field descriptions;
- Add blank lines and adjust indentation.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: Convert s390-diag.txt to ReST format

This file is almost in ReST format. Just one change was
needed:

- Add markups for a literal block and change its indentation.

While here, use the standard markup for the document title.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: Convert ppc-pv.txt to ReST format

- Use document title and chapter markups;
- Add markups for tables;
- Use list markups;
- Add markups for literal blocks;
- Add blank lines.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: Convert nested-vmx.txt to ReST format

This file is almost in ReST format. Just a small set of
changes were needed:

    - Add markups for lists;
    - Add markups for a literal block;
    - Adjust whitespaces.

While here, use the standard markup for the document title.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: Convert mmu.txt to ReST format

- Use document title and chapter markups;
- Add markups for tables;
- Add markups for literal blocks;
- Add blank lines and adjust indentation.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: Convert locking.txt to ReST format

- Use document title and chapter markups;
- Add markups for literal blocks;
- use :field: for field descriptions;
- Add blank lines and adjust indentation.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: Convert hypercalls.txt to ReST format

- Use document title and chapter markups;
- Convert tables;
- Add markups for literal blocks;
- use :field: for field descriptions;
- Add blank lines and adjust indentation

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: arm/psci.txt: convert to ReST

- Add a title for the document;
- Adjust whitespaces for it to be properly formatted after
parsed.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: convert arm/hyp-abi.txt to ReST

- Add proper markups for titles;
- Adjust whitespaces and blank lines to match ReST
needs;
- Mark literal blocks as such.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: Convert api.txt to ReST format

convert api.txt document to ReST format while trying to keep
its format as close as possible with the authors intent, and
avoid adding uneeded markups.

- Use document title and chapter markups;
- Convert tables;
- Add markups for literal blocks;
- use :field: for field descriptions;
- Add blank lines and adjust indentation

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: convert devices/xive.txt to ReST

- Use title markups;
- adjust indentation and add blank lines as needed;
- adjust tables to match ReST accepted formats;
- mark code blocks as such.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: convert devices/xics.txt to ReST

- Use title markups;
- adjust indentation and add blank lines as needed;
- adjust tables to match ReST accepted formats;
- use :field: markups.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: convert devices/vm.txt to ReST

- Use title markups;
- adjust indentation and add blank lines as needed;
- use :field: markups;
- Use cross-references;
- mark code blocks as such.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: convert devices/vfio.txt to ReST

- Use standard title markup;
- adjust lists;
- mark code blocks as such.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: convert devices/vcpu.txt to ReST

- Use title markups;
- adjust indentation and add blank lines as needed;
- adjust tables to match ReST accepted formats;
- use :field: markups;
- mark code blocks as such.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: convert devices/s390_flic.txt to ReST

- Use standard markup for document title;
- Adjust indentation and add blank lines as needed;
- use the notes markup;
- mark code blocks as such.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: convert devices/mpic.txt to ReST

This document is almost in ReST format. The only thing
needed is to mark a list as such and to add an extra
whitespace.

Yet, let's also use the standard document title markup,
as it makes easier if anyone wants later to add sessions
to it.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: convert devices/arm-vgit.txt to ReST

- Use title markups;
- change indent to match ReST syntax;
- use proper table markups;
- use literal block markups.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: devices/arm-vgit-v3.txt to ReST

- Use title markups;
- change indent to match ReST syntax;
- use proper table markups;
- use literal block markups.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: devices/arm-vgic-its.txt to ReST format

- Fix document title to match ReST format
- Convert the table to be properly recognized
- use proper markups for literal blocks
- Some indentation fixes to match ReST

While here, add an index for kvm devices.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: virt: Convert msr.txt to ReST format

- Use document title markup;
- Convert tables;
- Add blank lines and adjust indentation.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: virt: convert halt-polling.txt to ReST format

- Fix document title to match ReST format
- Convert the table to be properly recognized
- Some indentation fixes to match ReST syntax.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: virt: user_mode_linux.rst: fix URL references

Several URLs are pointing to outdated places.

Update the references for the URLs whose contents still exists,
removing the others.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: virt: user_mode_linux.rst: update compiling instructions

Instead of pointing for a pre-2.4 and a seaparate patch,
update it to match current upstream, as UML was merged
a long time ago.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: virt: convert UML documentation to ReST

Despite being an old document, it contains lots of information
that could still be useful.

The document has a nice style with makes easy to convert to
ReST. So, let's convert it to ReST.

This patch does:

- Use proper markups for titles;
- Mark and proper indent literal blocks;
- don't use an 'o' character for lists;
- other minor changes required for the doc to be parsed.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

docs: kvm: add arm/pvtime.rst to index.rst

Add this file to a new kvm/arm index.rst, in order for it to
be shown as part of the virt book.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: fix WARN_ON check of an unsigned less than zero

The check cpu->hv_clock.system_time < 0 is redundant since system_time
is a u64 and hence can never be less than zero. But what was actually
meant is to check that the result is positive, since kernel_ns and
v->kvm->arch.kvmclock_offset are both s64.

Reported-by: Colin King <[email protected]>
Suggested-by: Sean Christopherson <[email protected]>
Addresses-Coverity: ("Macro compares unsigned to 0")
Reviewed-by: Miaohe Lin <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

selftests: KVM: Remove unused x86_register enum

x86_register enum is not used, let's remove it.

Signed-off-by: Eric Auger <[email protected]>
Suggested-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Fix struct guest_walker arrays for 5-level paging

Define PT_MAX_FULL_LEVELS as PT64_ROOT_MAX_LEVEL, i.e. 5, to fix shadow
paging for 5-level guest page tables. PT_MAX_FULL_LEVELS is used to
size the arrays that track guest pages table information, i.e. using a
"max levels" of 4 causes KVM to access garbage beyond the end of an
array when querying state for level 5 entries. E.g. FNAME(gpte_changed)
will read garbage and most likely return %true for a level 5 entry,
soft-hanging the guest because FNAME(fetch) will restart the guest
instead of creating SPTEs because it thinks the guest PTE has changed.

Note, KVM doesn't yet support 5-level nested EPT, so PT_MAX_FULL_LEVELS
gets to stay "4" for the PTTYPE_EPT case.

Fixes: 855feb673640 ("KVM: MMU: Add 5 level EPT & Shadow page table support.")
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Use correct root level for nested EPT shadow page tables

Hardcode the EPT page-walk level for L2 to be 4 levels, as KVM's MMU
currently also hardcodes the page walk level for nested EPT to be 4
levels. The L2 guest is all but guaranteed to soft hang on its first
instruction when L1 is using EPT, as KVM will construct 4-level page
tables and then tell hardware to use 5-level page tables.

Fixes: 855feb673640 ("KVM: MMU: Add 5 level EPT & Shadow page table support.")
Cc: [email protected]
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Fix some comment typos and coding style

Fix some typos in the comments. Also fix coding style.
[Sean Christopherson rewrites the comment of write_fault_to_shadow_pgtable
field in struct kvm_vcpu_arch.]

Signed-off-by: Miaohe Lin <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86/mmu: Avoid retpoline on ->page_fault() with TDP

Wrap calls to ->page_fault() with a small shim to directly invoke the
TDP fault handler when the kernel is using retpolines and TDP is being
used. Single out the TDP fault handler and annotate the TDP path as
likely to coerce the compiler into preferring it over the indirect
function call.

Rename tdp_page_fault() to kvm_tdp_page_fault(), as it's exposed outside
of mmu.c to allow inlining the shim.

Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: apic: reuse smp_wmb() in kvm_make_request()

kvm_make_request() provides smp_wmb() so pending_events changes are
guaranteed to be visible.

Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: remove duplicated KVM_REQ_EVENT request

The KVM_REQ_EVENT request is already made in kvm_set_rflags(). We should
not make it again.

Signed-off-by: Miaohe Lin <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

selftests: KVM: SVM: Add vmcall test

L2 guest calls vmcall and L1 checks the exit status does
correspond.

Signed-off-by: Eric Auger <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Tested-by: Wei Huang <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

selftests: KVM: AMD Nested test infrastructure

Add the basic infrastructure needed to test AMD nested SVM.
This is largely copied from the KVM unit test infrastructure.

Signed-off-by: Eric Auger <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

selftests: KVM: Replace get_{gdt,idt}_base() by get_{gdt,idt}()

get_gdt_base() and get_idt_base() only return the base address
of the descriptor tables. Soon we will need to get the size as well.
Change the prototype of those functions so that they return
the whole desc_ptr struct instead of the address field.

Signed-off-by: Eric Auger <[email protected]>
Reviewed-by: Vitaly Kuznetsov <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Reviewed-by: Wei Huang <[email protected]>
Reviewed-by: Krish Sadhukhan <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

NFSv4: Fix revalidation of dentries with delegations

If a dentry was not initially looked up while we were holding a
delegation, then we do still need to revalidate that it still holds
the same name. If there are multiple hard links to the same file,
then all the hard links need validation.

Reported-by: Benjamin Coddington <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Reviewed-by: Benjamin Coddington <[email protected]>
Tested-by: Benjamin Coddington <[email protected]>
[Anna: Put nfs_unset_verifier_delegated() under CONFIG_NFS_V4]
Signed-off-by: Anna Schumaker <[email protected]>

Merge tag 'kbuild-fixes-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

- fix memory corruption in scripts/kallsyms

- fix the vmlinux link stage to correctly update compile.h

* tag 'kbuild-fixes-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: fix mismatch between .version and include/generated/compile.h
scripts/kallsyms: fix memory corruption caused by write over-run

net: ethernet: ave: Add capability of rgmii-id mode

This allows you to specify the type of rgmii-id that will enable phy
internal delay in ethernet phy-mode.

This adds all RGMII cases to all of get_pinmode() except LD11, because LD11
SoC doesn't support RGMII due to the constraint of the hardware. When RGMII
phy mode is specified in the devicetree for LD11, the driver will abort
with an error.

Signed-off-by: Kunihiko Hayashi <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

enic: prevent waking up stopped tx queues over watchdog reset

Recent months, our customer reported several kernel crashes all
preceding with following message:
NETDEV WATCHDOG: eth2 (enic): transmit queue 0 timed out
Error message of one of those crashes:
BUG: unable to handle kernel paging request at ffffffffa007e090

After analyzing severl vmcores, I found that most of crashes are
caused by memory corruption. And all the corrupted memory areas
are overwritten by data of network packets. Moreover, I also found
that the tx queues were enabled over watchdog reset.

After going through the source code, I found that in enic_stop(),
the tx queues stopped by netif_tx_disable() could be woken up over
a small time window between netif_tx_disable() and the
napi_disable() by the following code path:
napi_poll->
  enic_poll_msix_wq->
     vnic_cq_service->
        enic_wq_service->
           netif_wake_subqueue(enic->netdev, q_number)->
              test_and_clear_bit(__QUEUE_STATE_DRV_XOFF, &txq->state)
In turn, upper netowrk stack could queue skb to ENIC NIC though
enic_hard_start_xmit(). And this might introduce some race condition.

Our customer comfirmed that this kind of kernel crash doesn't occur over
90 days since they applied this patch.

Signed-off-by: Firo Yang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

btrfs: sysfs, move device id directories to UUID/devinfo

Originally it was planned to create device id directories under
UUID/devinfo, but it got under UUID/devices by mistake. We really want
it under definfo so the bare device node names are not mixed with device
ids and are easy to enumerate.

Fixes: 668e48af7a94 ("btrfs: sysfs, add devid/dev_state kobject and device attributes")
Signed-off-by: Anand Jain <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: sysfs, add UUID/devinfo kobject

Create directory /sys/fs/btrfs/UUID/devinfo to hold devices directories
by the id (unlike /devices).

Signed-off-by: Anand Jain <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

arm64: time: Replace <linux/clk-provider.h> by <linux/of_clk.h>

The arm64 time code is not a clock provider, and just needs to call
of_clk_init().

Hence it can include <linux/of_clk.h> instead of <linux/clk-provider.h>.

Reviewed-by: Stephen Boyd <[email protected]>
Signed-off-by: Geert Uytterhoeven <[email protected]>
Signed-off-by: Will Deacon <[email protected]>

Btrfs: fix race between shrinking truncate and fiemap

When there is a fiemap executing in parallel with a shrinking truncate
we can end up in a situation where we have extent maps for which we no
longer have corresponding file extent items. This is generally harmless
and at the moment the only consequences are missing file extent items
representing holes after we expand the file size again after the
truncate operation removed the prealloc extent items, and stale
information for future fiemap calls (reporting extents that no longer
exist or may have been reallocated to other files for example).

Consider the following example:

1) Our inode has a size of 128KiB, one 128KiB extent at file offset 0
   and a 1MiB prealloc extent at file offset 128KiB;

2) Task A starts doing a shrinking truncate of our inode to reduce it to
   a size of 64KiB. Before it searches the subvolume tree for file
   extent items to delete, it drops all the extent maps in the range
   from 64KiB to (u64)-1 by calling btrfs_drop_extent_cache();

3) Task B starts doing a fiemap against our inode. When looking up for
   the inode's extent maps in the range from 128KiB to (u64)-1, it
   doesn't find any in the inode's extent map tree, since they were
   removed by task A.  Because it didn't find any in the extent map
   tree, it scans the inode's subvolume tree for file extent items, and
   it finds the 1MiB prealloc extent at file offset 128KiB, then it
   creates an extent map based on that file extent item and adds it to
   inode's extent map tree (this ends up being done by
   btrfs_get_extent() <- btrfs_get_extent_fiemap() <-
   get_extent_skip_holes());

4) Task A then drops the prealloc extent at file offset 128KiB and
   shrinks the 128KiB extent file offset 0 to a length of 64KiB. The
   truncation operation finishes and we end up with an extent map
   representing a 1MiB prealloc extent at file offset 128KiB, despite we
   don't have any more that extent;

After this the two types of problems we have are:

1) Future calls to fiemap always report that a 1MiB prealloc extent
   exists at file offset 128KiB. This is stale information, no longer
   correct;

2) If the size of the file is increased, by a truncate operation that
   increases the file size or by a write into a file offset > 64KiB for
   example, we end up not inserting file extent items to represent holes
   for any range between 128KiB and 128KiB + 1MiB, since the hole
   expansion function, btrfs_cont_expand() will skip hole insertion for
   any range for which an extent map exists that represents a prealloc
   extent. This causes fsck to complain about missing file extent items
   when not using the NO_HOLES feature.

The second issue could be often triggered by test case generic/561 from
fstests, which runs fsstress and duperemove in parallel, and duperemove
does frequent fiemap calls.

Essentially the problems happens because fiemap does not acquire the
inode's lock while truncate does, and fiemap locks the file range in the
inode's iotree while truncate does not. So fix the issue by making
btrfs_truncate_inode_items() lock the file range from the new file size
to (u64)-1, so that it serializes with fiemap.

CC: [email protected] # 4.4+
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: log message when rw remount is attempted with unclean tree-log

A remount to a read-write filesystem is not safe when there's tree-log
to be replayed. Files that could be opened until now might be affected
by the changes in the tree-log.

A regular mount is needed to replay the log so the filesystem presents
the consistent view with the pending changes included.

CC: [email protected] # 4.4+
Reviewed-by: Anand Jain <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: print message when tree-log replay starts

There's no logged information about tree-log replay although this is
something that points to previous unclean unmount. Other filesystems
report that as well.

Suggested-by: Chris Murphy <[email protected]>
CC: [email protected] # 4.4+
Reviewed-by: Anand Jain <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: David Sterba <[email protected]>

Btrfs: fix race between using extent maps and merging them

We have a few cases where we allow an extent map that is in an extent map
tree to be merged with other extents in the tree. Such cases include the
unpinning of an extent after the respective ordered extent completed or
after logging an extent during a fast fsync. This can lead to subtle and
dangerous problems because when doing the merge some other task might be
using the same extent map and as consequence see an inconsistent state of
the extent map - for example sees the new length but has seen the old start
offset.

With luck this triggers a BUG_ON(), and not some silent bug, such as the
following one in __do_readpage():

  $ cat -n fs/btrfs/extent_io.c
  3061  static int __do_readpage(struct extent_io_tree *tree,
  3062                           struct page *page,
  (...)
  3127                  em = __get_extent_map(inode, page, pg_offset, cur,
  3128                                        end - cur + 1, get_extent, em_cached);
  3129                  if (IS_ERR_OR_NULL(em)) {
  3130                          SetPageError(page);
  3131                          unlock_extent(tree, cur, end);
  3132                          break;
  3133                  }
  3134                  extent_offset = cur - em->start;
  3135                  BUG_ON(extent_map_end(em) <= cur);
  (...)

Consider the following example scenario, where we end up hitting the
BUG_ON() in __do_readpage().

We have an inode with a size of 8KiB and 2 extent maps:

  extent A: file offset 0, length 4KiB, disk_bytenr = X, persisted on disk by
            a previous transaction

  extent B: file offset 4KiB, length 4KiB, disk_bytenr = X + 4KiB, not yet
            persisted but writeback started for it already. The extent map
    is pinned since there's writeback and an ordered extent in
    progress, so it can not be merged with extent map A yet

The following sequence of steps leads to the BUG_ON():

1) The ordered extent for extent B completes, the respective page gets its
   writeback bit cleared and the extent map is unpinned, at that point it
   is not yet merged with extent map A because it's in the list of modified
   extents;

2) Due to memory pressure, or some other reason, the MM subsystem releases
   the page corresponding to extent B - btrfs_releasepage() is called and
   returns 1, meaning the page can be released as it's not dirty, not under
   writeback anymore and the extent range is not locked in the inode's
   iotree. However the extent map is not released, either because we are
   not in a context that allows memory allocations to block or because the
   inode's size is smaller than 16MiB - in this case our inode has a size
   of 8KiB;

3) Task B needs to read extent B and ends up __do_readpage() through the
   btrfs_readpage() callback. At __do_readpage() it gets a reference to
   extent map B;

4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B
   while holding the write lock on the inode's extent map tree - this
   results in try_merge_map() being called and since it's possible to merge
   extent map B with extent map A now (the extent map B was removed from
   the list of modified extents), the merging begins - it sets extent map
   B's start offset to 0 (was 4KiB), but before it increments the map's
   length to 8KiB (4kb + 4KiB), task A is at:

   BUG_ON(extent_map_end(em) <= cur);

   The call to extent_map_end() sees the extent map has a start of 0
   and a length still at 4KiB, so it returns 4KiB and 'cur' is 4KiB, so
   the BUG_ON() is triggered.

So it's dangerous to modify an extent map that is in the tree, because some
other task might have got a reference to it before and still using it, and
needs to see a consistent map while using it. Generally this is very rare
since most paths that lookup and use extent maps also have the file range
locked in the inode's iotree. The fsync path is pretty much the only
exception where we don't do it to avoid serialization with concurrent
reads.

Fix this by not allowing an extent map do be merged if if it's being used
by tasks other then the one attempting to merge the extent map (when the
reference count of the extent map is greater than 2).

Reported-by: ryusuke1925 <[email protected]>
Reported-by: Koki Mitani <[email protected]>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211
CC: [email protected] # 4.4+
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: ref-verify: fix memory leaks

In btrfs_ref_tree_mod(), 'ref' and 'ra' are allocated through kzalloc() and
kmalloc(), respectively. In the following code, if an error occurs, the
execution will be redirected to 'out' or 'out_unlock' and the function will
be exited. However, on some of the paths, 'ref' and 'ra' are not
deallocated, leading to memory leaks. For example, if 'action' is
BTRFS_ADD_DELAYED_EXTENT, add_block_entry() will be invoked. If the return
value indicates an error, the execution will be redirected to 'out'. But,
'ref' is not deallocated on this path, causing a memory leak.

To fix the above issues, deallocate both 'ref' and 'ra' before exiting from
the function when an error is encountered.

CC: [email protected] # 4.15+
Signed-off-by: Wenwen Wang <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>

tools headers kvm: Sync linux/kvm.h with the kernel sources

To pick up the changes from:

  7de3f1423ff9 ("KVM: s390: Add new reset vcpu API")

So far we're ignoring those arch specific ioctls, we need to revisit
this at some time to have arch specific tables, etc:

  $ grep S390 tools/perf/trace/beauty/kvm_ioctl.sh
   egrep -v " ((ARM|PPC|S390)_|[GS]ET_(DEBUGREGS|PIT2|XSAVE|TSC_KHZ)|CREATE_SPAPR_TCE_64)" | \
  $

This addresses these tools/perf build warnings:

  Warning: Kernel ABI header at 'tools/arch/arm/include/uapi/asm/kvm.h' differs from latest version at 'arch/arm/include/uapi/asm/kvm.h'
  diff -u tools/arch/arm/include/uapi/asm/kvm.h arch/arm/include/uapi/asm/kvm.h

Cc: Adrian Hunter <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Janosch Frank <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

tools headers kvm: Sync kvm headers with the kernel sources

To pick up the changes from:

290a6bb06de9 ("arm64: KVM: Add UAPI notes for swapped registers")

No tools changes are caused by this.

This addresses these tools/perf build warnings:

Cc: Adrian Hunter <[email protected]>
Cc: Andrew Jones <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Namhyung Kim <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

tools arch x86: Sync asm/cpufeatures.h with the kernel sources

To pick up the changes from:

  85c17291e2eb ("x86/cpufeatures: Add flag to track whether MSR IA32_FEAT_CTL is configured")
  f444a5ff95dc ("x86/cpufeatures: Add support for fast short REP; MOVSB")

These don't cause any changes in tooling, just silences this perf build
warning:

  Warning: Kernel ABI header at 'tools/arch/x86/include/asm/cpufeatures.h' differs from latest version at 'arch/x86/include/asm/cpufeatures.h'
  diff -u tools/arch/x86/include/asm/cpufeatures.h arch/x86/include/asm/cpufeatures.h

Cc: Adrian Hunter <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Tony Luck <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

tools headers x86: Sync disabled-features.h

To silence the following tools/perf build warning:

  Warning: Kernel ABI header at 'tools/arch/x86/include/asm/disabled-features.h' differs from latest version at 'arch/x86/include/asm/disabled-features.h'
  diff -u tools/arch/x86/include/asm/disabled-features.h arch/x86/include/asm/disabled-features.h

Picking up the changes in:

  45fc24e89b7c ("x86/mpx: remove MPX from arch/x86")

that didn't entail any functionality change in the tooling side.

Cc: Adrian Hunter <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

drm/i915: Mark the removal of the i915_request from the sched.link

Keep the rq->fence.flags consistent with the status of the
rq->sched.link, and clear the associated bits when decoupling the link
on retirement (as we may wish to inspect those flags independent of
other state).

Fixes: c3f1ed90e6ff ("drm/i915/gt: Allow temporary suspension of inflight requests")
References: https://gitlab.freedesktop.org/drm/intel/issues/997
Signed-off-by: Chris Wilson <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Reviewed-by: Tvrtko Ursulin <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit b4a9a149f91ea345da76bcfe3f8a39715ac346a6)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915/execlists: Reclaim the hanging virtual request

If we encounter a hang on a virtual engine, as we process the hang the
request may already have been moved back to the virtual engine (we are
processing the hang on the physical engine). We need to reclaim the
request from the virtual engine so that the locking is consistent and
local to the real engine on which we will hold the request for error
state capturing.

v2: Pull the reclamation into execlists_hold() and assert that cannot be
called from outside of the reset (i.e. with the tasklet disabled).
v3: Added selftest
v4: Drop the reference owned by the virtual engine

Fixes: ad18ba7b5eeb ("drm/i915/execlists: Offline error capture")
Testcase: igt/gem_exec_balancer/hang
Signed-off-by: Chris Wilson <[email protected]>
Cc: Mika Kuoppala <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Reviewed-by: Tvrtko Ursulin <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 989df3a7bd2abe566521e61d1aebf603eb013b7f)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915/execlists: Take a reference while capturing the guilty request

Thanks to preempt-to-busy, we leave the request on the HW as we submit
the preemption request. This means that the request may complete at any
moment as we process HW events, and in particular the request may be
retired as we are planning to capture it for a preemption timeout.

Be more careful while obtaining the request to capture after a
preemption timeout, and check to see if it completed before we were able
to put it on the on-hold list. If we do see it did complete just before
we capture the request, proclaim the preemption-timeout a false positive
and pardon the reset as we should hit an arbitration point momentarily
and so be able to process the preemption.

Note that even after we move the request to be on hold it may be retired
(as the reset to stop the HW comes after), so we do require to hold our
own reference as we work on the request for capture (and all of the
peeking at state within the request needs to be carefully protected).

Fixes: c3f1ed90e6ff ("drm/i915/gt: Allow temporary suspension of inflight requests")
Closes: https://gitlab.freedesktop.org/drm/intel/issues/997
Signed-off-by: Chris Wilson <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Reviewed-by: Tvrtko Ursulin <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 4ba5c086a1d8e38d6927967ae1a3271a6ab7a927)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915/execlists: Offline error capture

Currently, we skip error capture upon forced preemption. We apply forced
preemption when there is a higher priority request that should be
running but is being blocked, and we skip inline error capture so that
the preemption request is not further delayed by a user controlled
capture -- extending the denial of service.

However, preemption reset is also used for heartbeats and regular GPU
hangs. By skipping the error capture, we remove the ability to debug GPU
hangs.

In order to capture the error without delaying the preemption request
further, we can do an out-of-line capture by removing the guilty request
from the execution queue and scheduling a worker to dump that request.
When removing a request, we need to remove the entire context and all
descendants from the execution queue, so that they do not jump past.

Closes: https://gitlab.freedesktop.org/drm/intel/issues/738
Fixes: 3a7a92aba8fb ("drm/i915/execlists: Force preemption")
Signed-off-by: Chris Wilson <[email protected]>
Cc: Mika Kuoppala <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Reviewed-by: Tvrtko Ursulin <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 748317386afb235e11616098d2c7772e49776b58)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915/gt: Allow temporary suspension of inflight requests

In order to support out-of-line error capture, we need to remove the
active request from HW and put it to one side while a worker compresses
and stores all the details associated with that request. (As that
compression may take an arbitrary user-controlled amount of time, we
want to let the engine continue running on other workloads while the
hanging request is dumped.) Not only do we need to remove the active
request, but we also have to remove its context and all requests that
were dependent on it (both in flight, queued and future submission).

Finally once the capture is complete, we need to be able to resubmit the
request and its dependents and allow them to execute.

v2: Replace stack recursion with a simple list.
v3: Check all the parents, not just the first, when searching for a
stuck ancestor!

References: https://gitlab.freedesktop.org/drm/intel/issues/738
Signed-off-by: Chris Wilson <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Reviewed-by: Tvrtko Ursulin <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 32ff621fd74496f0c33644125fb69ff175859b1f)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915: Keep track of request among the scheduling lists

If we keep track of when the i915_request.sched.link is on the HW
runlist, or in the priority queue we can simplify our interactions with
the request (such as during rescheduling). This also simplifies the next
patch where we introduce a new in-between list, for requests that are
ready but neither on the run list or in the queue.

v2: Update i915_sched_node.link explanation for current usage where it
is a link on both the queue and on the runlists.

Signed-off-by: Chris Wilson <[email protected]>
Cc: Mika Kuoppala <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Reviewed-by: Tvrtko Ursulin <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 672c368f9398042b629740cc9816e8e939eff2db)
Signed-off-by: Jani Nikula <[email protected]>

Merge tag 'gvt-fixes-2020-02-12' of https://github.com/intel/gvt-linux into drm-intel-next-fixes

gvt-fixes-2020-02-12

- fix possible high-order allocation fail for late load (Igor)
- fix one missed lock for ppgtt mm LRU list (Igor)

Signed-off-by: Jani Nikula <[email protected]>
From: Zhenyu Wang <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

tools include UAPI: Sync sound/asound.h copy

Picking the changes from:

  46b770f720bd ("ALSA: uapi: Fix sparse warning")
  a103a3989993 ("ALSA: control: Fix incompatible protocol error")
  bd3eb4e87eb3 ("ALSA: ctl: bump protocol version up to v2.1.0")
  ff16351e3f30 ("ALSA: ctl: remove dimen member from elem_info structure")
  542283566679 ("ALSA: ctl: remove unused macro for timestamping of elem_value")
  7fd7d6c50451 ("ALSA: uapi: Fix typos and header inclusion in asound.h")
  1cfaef961703 ("ALSA: bump uapi version numbers")
  80fe7430c708 ("ALSA: add new 32-bit layout for snd_pcm_mmap_status/control")
  07094ae6f952 ("ALSA: Avoid using timespec for struct snd_timer_tread")
  d9e5582c4bb2 ("ALSA: Avoid using timespec for struct snd_rawmidi_status")
  3ddee7f88aaf ("ALSA: Avoid using timespec for struct snd_pcm_status")
  a4e7dd35b9da ("ALSA: Avoid using timespec for struct snd_ctl_elem_value")
  a07804cc7472 ("ALSA: Avoid using timespec for struct snd_timer_status")

Which entails no changes in the tooling side.

To silence this perf tools build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/sound/asound.h' differs from latest version at 'include/uapi/sound/asound.h'
  diff -u tools/include/uapi/sound/asound.h include/uapi/sound/asound.h

Cc: Adrian Hunter <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Ranjani Sridharan <[email protected]>
Cc: Takashi Iwai <[email protected]>
Cc: Takashi Sakamoto <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

tools headers UAPI: Sync asm-generic/mman-common.h with the kernel

To pick the changes from:

  d41938d2cbee ("mm: Reserve asm-generic prot flags 0x10 and 0x20 for arch use")

No changes in tooling, just a rebuild as files needed got touched.

This addresses the following perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/mman-common.h' differs from latest version at 'include/uapi/asm-generic/mman-common.h'
  diff -u tools/include/uapi/asm-generic/mman-common.h include/uapi/asm-generic/mman-common.h

Cc: Adrian Hunter <[email protected]>
Cc: Dave Martin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

perf tools: Add arm64 version of get_cpuid()

Add an arm64 version of get_cpuid(), which is used for various annotation
and headers - for example, I now get the CPUID in "perf report --header",
as shown in this snippet:

  # hostname : ubuntu
  # os release : 5.5.0-rc1-dirty
  # perf version : 5.5.rc1.gbf8a13dc9851
  # arch : aarch64
  # nrcpus online : 96
  # nrcpus avail : 96
  # cpuid : 0x00000000480fd010

Since much of the code to read the MIDR is already in get_cpuid_str(),
factor out this code.

Tester notes:

I tested this patch on my new ARM64 Kunpeng 920 server.
[root@node1 zsk]# ./perf --version
perf version 5.6.rc1.g2cdb955b7252

Both perf list and perf stat can work.

Signed-off-by: John Garry <[email protected]>
Tested-by: Shaokun Zhang <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lore.kernel.org/lkml/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

tools headers UAPI: Sync drm/i915_drm.h with the kernel sources

To pick the change in:

  cc662126b413 ("drm/i915: Introduce DRM_I915_GEM_MMAP_OFFSET")

That don't result in any changes in tooling, just silences this perf
build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/drm/i915_drm.h' differs from latest version at 'include/uapi/drm/i915_drm.h'
  diff -u tools/include/uapi/drm/i915_drm.h include/uapi/drm/i915_drm.h

Cc: Abdiel Janulgue <[email protected]>
Cc: Adrian Hunter <[email protected]>
Cc: Chris Wilson <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

tools headers uapi: Sync linux/fscrypt.h with the kernel sources

To pick the changes from:

  e933adde6f97 ("fscrypt: include <linux/ioctl.h> in UAPI header")
  93edd392cad7 ("fscrypt: support passing a keyring key to FS_IOC_ADD_ENCRYPTION_KEY")

That don't trigger any changes in tooling.

This silences this perf build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/fscrypt.h' differs from latest version at 'include/uapi/linux/fscrypt.h'
  diff -u tools/include/uapi/linux/fscrypt.h include/uapi/linux/fscrypt.h

Cc: Adrian Hunter <[email protected]>
Cc: Eric Biggers <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Namhyung Kim <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

KVM: x86: Deliver exception payload on KVM_GET_VCPU_EVENTS

KVM allows the deferral of exception payloads when a vCPU is in guest
mode to allow the L1 hypervisor to intercept certain events (#PF, #DB)
before register state has been modified. However, this behavior is
incompatible with the KVM_{GET,SET}_VCPU_EVENTS ABI, as userspace
expects register state to have been immediately modified. Userspace may
opt-in for the payload deferral behavior with the
KVM_CAP_EXCEPTION_PAYLOAD per-VM capability. As such,
kvm_multiple_exception() will immediately manipulate guest registers if
the capability hasn't been requested.

Since the deferral is only necessary if a userspace ioctl were to be
serviced at the same as a payload bearing exception is recognized, this
behavior can be relaxed. Instead, opportunistically defer the payload
from kvm_multiple_exception() and deliver the payload before completing
a KVM_GET_VCPU_EVENTS ioctl.

Signed-off-by: Oliver Upton <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: nVMX: Handle pending #DB when injecting INIT VM-exit

SDM 27.3.4 states that the 'pending debug exceptions' VMCS field will
be populated if a VM-exit caused by an INIT signal takes priority over a
debug-trap. Emulate this behavior when synthesizing an INIT signal
VM-exit into L1.

Fixes: 4b9852f4f389 ("KVM: x86: Fix INIT signal handling in various CPU states")
Signed-off-by: Oliver Upton <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: x86: Mask off reserved bit from #DB exception payload

KVM defines the #DB payload as compatible with the 'pending debug
exceptions' field under VMX, not DR6. Mask off bit 12 when applying the
payload to DR6, as it is reserved on DR6 but not the 'pending debug
exceptions' field.

Fixes: f10c729ff965 ("kvm: vmx: Defer setting of DR6 until #DB delivery")
Signed-off-by: Oliver Upton <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

drm/i915/gem: Tighten checks and acquiring the mmap object

Make sure we hold the rcu lock as we acquire the rcu protected reference
of the object when looking it up from the associated mmap vma.

Closes: https://gitlab.freedesktop.org/drm/intel/issues/1083
Fixes: cc662126b413 ("drm/i915: Introduce DRM_I915_GEM_MMAP_OFFSET")
Signed-off-by: Chris Wilson <[email protected]>
Cc: Abdiel Janulgue <[email protected]>
Cc: Matthew Auld <[email protected]>
Reviewed-by: Matthew Auld <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 280d14a69da2e71f43408537c008f2775d5e5360)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915: Fix preallocated barrier list append

Only the first and the last nodes were being added to
ref->preallocated_barriers.

Renaming variables to make it more easy to read.

Fixes: 841350223816 ("drm/i915/gt: Drop mutex serialisation between context pin/unpin")
Cc: Chris Wilson <[email protected]>
Cc: Maarten Lankhorst <[email protected]>
Signed-off-by: José Roberto de Souza <[email protected]>
Reviewed-by: Chris Wilson <[email protected]>
Signed-off-by: Chris Wilson <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit d4c3c0b8221a72107eaf35c80c40716b81ca463e)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915/gt: Acquire ce->active before ce->pin_count/ce->pin_mutex

Similar to commit ac0e331a628b ("drm/i915: Tighten atomicity of
i915_active_acquire vs i915_active_release") we have the same race of
trying to pin the context underneath a mutex while allowing the
decrement to be atomic outside of that mutex. This leads to the problem
where two threads may simultaneously try to pin the context and the
second not notice that they needed to repin the context.

<2> [198.669621] kernel BUG at drivers/gpu/drm/i915/gt/intel_timeline.c:387!
<4> [198.669703] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<4> [198.669712] CPU: 0 PID: 1246 Comm: gem_exec_create Tainted: G     U  W         5.5.0-rc6-CI-CI_DRM_7755+ #1
<4> [198.669723] Hardware name:  /NUC7i5BNB, BIOS BNKBL357.86A.0054.2017.1025.1822 10/25/2017
<4> [198.669776] RIP: 0010:timeline_advance+0x7b/0xe0 [i915]
<4> [198.669785] Code: 00 48 c7 c2 10 f1 46 a0 48 c7 c7 70 1b 32 a0 e8 bb dd e7 e0 bf 01 00 00 00 e8 d1 af e7 e0 31 f6 bf 09 00 00 00 e8 35 ef d8 e0 <0f> 0b 48 c7 c1 48 fa 49 a0 ba 84 01 00 00 48 c7 c6 10 f1 46 a0 48
<4> [198.669803] RSP: 0018:ffffc900004c3a38 EFLAGS: 00010296
<4> [198.669810] RAX: ffff888270b35140 RBX: ffff88826f32ee00 RCX: 0000000000000006
<4> [198.669818] RDX: 00000000000017c5 RSI: 0000000000000000 RDI: 0000000000000009
<4> [198.669826] RBP: ffffc900004c3a64 R08: 0000000000000000 R09: 0000000000000000
<4> [198.669834] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88826f9b5980
<4> [198.669841] R13: 0000000000000cc0 R14: ffffc900004c3dc0 R15: ffff888253610068
<4> [198.669849] FS:  00007f63e663fe40(0000) GS:ffff888276c00000(0000) knlGS:0000000000000000
<4> [198.669857] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [198.669864] CR2: 00007f171f8e39a8 CR3: 000000026b1f6005 CR4: 00000000003606f0
<4> [198.669872] Call Trace:
<4> [198.669924]  intel_timeline_get_seqno+0x12/0x40 [i915]
<4> [198.669977]  __i915_request_create+0x76/0x5a0 [i915]
<4> [198.670024]  i915_request_create+0x86/0x1c0 [i915]
<4> [198.670068]  i915_gem_do_execbuffer+0xbf2/0x2500 [i915]
<4> [198.670082]  ? __lock_acquire+0x460/0x15d0
<4> [198.670128]  i915_gem_execbuffer2_ioctl+0x11f/0x470 [i915]
<4> [198.670171]  ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915]
<4> [198.670181]  drm_ioctl_kernel+0xa7/0xf0
<4> [198.670188]  drm_ioctl+0x2e1/0x390
<4> [198.670233]  ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915]

Fixes: 841350223816 ("drm/i915/gt: Drop mutex serialisation between context pin/unpin")
References: ac0e331a628b ("drm/i915: Tighten atomicity of i915_active_acquire vs i915_active_release")
Signed-off-by: Chris Wilson <[email protected]>
Cc: Tvrtko Ursulin <[email protected]>
Reviewed-by: Tvrtko Ursulin <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit e5429340bfa2dc43a07c3329e0c30cdae4cc0b35)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915: Tighten atomicity of i915_active_acquire vs i915_active_release

As we use a mutex to serialise the first acquire (as it may be a lengthy
operation), but only an atomic decrement for the release, we have to
be careful in case a second thread races and completes both
acquire/release as the first finishes its acquire.

Thread A Thread B
i915_active_acquire i915_active_acquire
  atomic_read() == 0   atomic_read() == 0
  mutex_lock()   mutex_lock()
  atomic_read() == 0
    ref->active();
  atomic_inc()
  mutex_unlock()
  atomic_read() == 1
i915_active_release
  atomic_dec_and_test() -> 0
    ref->retire()
  atomic_inc() -> 1
  mutex_unlock()

So thread A has acquired the ref->active_count but since the ref was
still active at the time, it did not initialise it. By switching the
check inside the mutex to an atomic increment only if already active, we
close the race.

Fixes: c9ad602feabe ("drm/i915: Split i915_active.mutex into an irq-safe spinlock for the rbtree")
Signed-off-by: Chris Wilson <[email protected]>
Reviewed-by: Tvrtko Ursulin <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit ac0e331a628b5ded087eab09fad2ffb082ac61ba)
Signed-off-by: Jani Nikula <[email protected]>

drm/i915: Stub out i915_gpu_coredump_put

i915_gpu_coreddump_put is currently only defined if
CONFIG_DRM_I915_CAPTURE_ERROR is enabled, provide a stub otherwise.

Reported-by: Mike Lothian <[email protected]>
Fixes: 742379c0c400 ("drm/i915: Start chopping up the GPU error capture")
Signed-off-by: Chris Wilson <[email protected]>
Cc: Mike Lothian <[email protected]>
Cc: Andi Shyti <[email protected]>
Reviewed-by: Andi Shyti <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 7e36505d0cf82f2920f2fd22ebb14a8b540396a3)
Signed-off-by: Jani Nikula <[email protected]>