Git Repo - linux.git/log

io-wq: fix hang after cancelling pending hashed work

Don't forget to update wqe->hash_tail after cancelling a pending work
item, if it was hashed.

Cc: [email protected] # 5.7+
Reported-by: Dmitry Shulyak <[email protected]>
Fixes: 86f3cd1b589a1 ("io-wq: handle hashed writes in chains")
Signed-off-by: Pavel Begunkov <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

io_uring: don't recurse on tsk->sighand->siglock with signalfd

If an application is doing reads on signalfd, and we arm the poll handler
because there's no data available, then the wakeup can recurse on the
tasks sighand->siglock as the signal delivery from task_work_add() will
use TWA_SIGNAL and that attempts to lock it again.

We can detect the signalfd case pretty easily by comparing the poll->head
wait_queue_head_t with the target task signalfd wait queue. Just use
normal task wakeup for this case.

Cc: [email protected] # v5.7+
Signed-off-by: Jens Axboe <[email protected]>

xhci: Always restore EP_SOFT_CLEAR_TOGGLE even if ep reset failed

Some device drivers call libusb_clear_halt when target ep queue
is not empty. (eg. spice client connected to qemu for usb redir)

Before commit f5249461b504 ("xhci: Clear the host side toggle
manually when endpoint is soft reset"), that works well.
But now, we got the error log:

EP not empty, refuse reset

xhci_endpoint_reset failed and left ep_state's EP_SOFT_CLEAR_TOGGLE
bit still set

So all the subsequent urb sumbits to the ep will fail with the
warn log:

Can't enqueue URB while manually clearing toggle

We need to clear ep_state EP_SOFT_CLEAR_TOGGLE bit after
xhci_endpoint_reset, even if it failed.

Fixes: f5249461b504 ("xhci: Clear the host side toggle manually when endpoint is soft reset")
Cc: stable <[email protected]> # v4.17+
Signed-off-by: Ding Hui <[email protected]>
Signed-off-by: Mathias Nyman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>

xhci: Do warm-reset when both CAS and XDEV_RESUME are set

Sometimes re-plugging a USB device during system sleep renders the device
useless:
[  173.418345] xhci_hcd 0000:00:14.0: Get port status 2-4 read: 0x14203e2, return 0x10262
...
[  176.496485] usb 2-4: Waited 2000ms for CONNECT
[  176.496781] usb usb2-port4: status 0000.0262 after resume, -19
[  176.497103] usb 2-4: can't resume, status -19
[  176.497438] usb usb2-port4: logical disconnect

Because PLS equals to XDEV_RESUME, xHCI driver reports U3 to usbcore,
despite of CAS bit is flagged.

So proritize CAS over XDEV_RESUME to let usbcore handle warm-reset for
the port.

Cc: stable <[email protected]>
Signed-off-by: Kai-Heng Feng <[email protected]>
Signed-off-by: Mathias Nyman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>

usb: host: xhci: fix ep context print mismatch in debugfs

dci is 0 based and xhci_get_ep_ctx() will do ep index increment to get
the ep context.

[rename dci to ep_index -Mathias]
Cc: stable <[email protected]> # v4.15+
Fixes: 02b6fdc2a153 ("usb: xhci: Add debugfs interface for xHCI driver")
Signed-off-by: Li Jun <[email protected]>
Signed-off-by: Mathias Nyman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Merge tag 'misc-habanalabs-fixes-2020-08-22' of git://people.freedesktop.org/~gabbayo/linux into char-misc-linus

Oded writes:

This tag contains the following bug fixes for 5.9-rc2/3:

- Correct cleanup of PCI bar mapping in case of failure during
  initialization.

- Several security fixes:
  - Validating user addresses before mapping them
  - Validating packet id (from user) before using it as index for array.
  - Validating F/W file size before coping it.
  - Prevent possible overflow when validating address from user in
    profiler.
  - Validate queue index (from user) before using it as index for array.
  - Check for correct vmalloc return code

- Fix memory corruption in debugfs entry

- Fix a loop in gaudi_extract_ecc_info()

- Fix the set clock gating function in gaudi code

- Set maximum power to F/W according to the card type

- Cix incorrect check on failed workqueue create

- Correctly report error when configuring the PCI controller

* tag 'misc-habanalabs-fixes-2020-08-22' of git://people.freedesktop.org/~gabbayo/linux:
  habanalabs: correctly report inbound pci region cfg error
  habanalabs: check correct vmalloc return code
  habanalabs: validate FW file size
  habanalabs: fix incorrect check on failed workqueue create
  habanalabs: set max power according to card type
  habanalabs: proper handling of alloc size in coresight
  habanalabs: set clock gating according to mask
  habanalabs: verify user input in cs_ioctl_signal_wait
  habanalabs: Fix a loop in gaudi_extract_ecc_info()
  habanalabs: Fix memory corruption in debugfs
  habanalabs: validate packet id during CB parse
  habanalabs: Validate user address before mapping
  habanalabs: unmap PCI bars upon iATU failure

Merge branch 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull epoll fixes from Al Viro:
"Fix reference counting and clean up exit paths"

* 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
do_epoll_ctl(): clean the failure exits up a bit
epoll: Keep a reference on files added to the check list

do_epoll_ctl(): clean the failure exits up a bit

Signed-off-by: Al Viro <[email protected]>

epoll: Keep a reference on files added to the check list

When adding a new fd to an epoll, and that this new fd is an
epoll fd itself, we recursively scan the fds attached to it
to detect cycles, and add non-epool files to a "check list"
that gets subsequently parsed.

However, this check list isn't completely safe when deletions
can happen concurrently. To sidestep the issue, make sure that
a struct file placed on the check list sees its f_count increased,
ensuring that a concurrent deletion won't result in the file
disapearing from under our feet.

Cc: [email protected]
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Al Viro <[email protected]>

net: nexthop: don't allow empty NHA_GROUP

Currently the nexthop code will use an empty NHA_GROUP attribute, but it
requires at least 1 entry in order to function properly. Otherwise we
end up derefencing null or random pointers all over the place due to not
having any nh_grp_entry members allocated, nexthop code relies on having at
least the first member present. Empty NHA_GROUP doesn't make any sense so
just disallow it.
Also add a WARN_ON for any future users of nexthop_create_group().

BUG: kernel NULL pointer dereference, address: 0000000000000080
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP
CPU: 0 PID: 558 Comm: ip Not tainted 5.9.0-rc1+ #93
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
RIP: 0010:fib_check_nexthop+0x4a/0xaa
Code: 0f 84 83 00 00 00 48 c7 02 80 03 f7 81 c3 40 80 fe fe 75 12 b8 ea ff ff ff 48 85 d2 74 6b 48 c7 02 40 03 f7 81 c3 48 8b 40 10 <48> 8b 80 80 00 00 00 eb 36 80 78 1a 00 74 12 b8 ea ff ff ff 48 85
RSP: 0018:ffff88807983ba00 EFLAGS: 00010213
RAX: 0000000000000000 RBX: ffff88807983bc00 RCX: 0000000000000000
RDX: ffff88807983bc00 RSI: 0000000000000000 RDI: ffff88807bdd0a80
RBP: ffff88807983baf8 R08: 0000000000000dc0 R09: 000000000000040a
R10: 0000000000000000 R11: ffff88807bdd0ae8 R12: 0000000000000000
R13: 0000000000000000 R14: ffff88807bea3100 R15: 0000000000000001
FS:  00007f10db393700(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000080 CR3: 000000007bd0f004 CR4: 00000000003706f0
Call Trace:
  fib_create_info+0x64d/0xaf7
  fib_table_insert+0xf6/0x581
  ? __vma_adjust+0x3b6/0x4d4
  inet_rtm_newroute+0x56/0x70
  rtnetlink_rcv_msg+0x1e3/0x20d
  ? rtnl_calcit.isra.0+0xb8/0xb8
  netlink_rcv_skb+0x5b/0xac
  netlink_unicast+0xfa/0x17b
  netlink_sendmsg+0x334/0x353
  sock_sendmsg_nosec+0xf/0x3f
  ____sys_sendmsg+0x1a0/0x1fc
  ? copy_msghdr_from_user+0x4c/0x61
  ___sys_sendmsg+0x63/0x84
  ? handle_mm_fault+0xa39/0x11b5
  ? sockfd_lookup_light+0x72/0x9a
  __sys_sendmsg+0x50/0x6e
  do_syscall_64+0x54/0xbe
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f10dacc0bb7
Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb cd 66 0f 1f 44 00 00 8b 05 9a 4b 2b 00 85 c0 75 2e 48 63 ff 48 63 d2 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 b1 f2 2a 00 f7 d8 64 89 02 48
RSP: 002b:00007ffcbe628bf8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007ffcbe628f80 RCX: 00007f10dacc0bb7
RDX: 0000000000000000 RSI: 00007ffcbe628c60 RDI: 0000000000000003
RBP: 000000005f41099c R08: 0000000000000001 R09: 0000000000000008
R10: 00000000000005e9 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00007ffcbe628d70 R15: 0000563a86c6e440
Modules linked in:
CR2: 0000000000000080

CC: David Ahern <[email protected]>
Fixes: 430a049190de ("nexthop: Add support for nexthop groups")
Reported-by: [email protected]
Signed-off-by: Nikolay Aleksandrov <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

drm/msm/a6xx: fix frequency not always being restored on GMU resume

The patch reorganizing the set_freq function made it so the gmu resume
doesn't always set the frequency, because a6xx_gmu_set_freq() exits early
when the frequency hasn't been changed. Note this always happens when
resuming GMU after recovering from a hang.

Use a simple workaround to prevent this from happening.

Fixes: 1f60d11423db ("drm: msm: a6xx: send opp instead of a frequency")
Signed-off-by: Jonathan Marek <[email protected]>
Signed-off-by: Rob Clark <[email protected]>

drm/msm/a6xx: add module param to enable debugbus snapshot

For production devices, the debugbus sections will typically be fused
off and empty in the gpu device coredump. But since this may contain
data like cache contents, don't capture it by default.

Signed-off-by: Rob Clark <[email protected]>
Reviewed-by: Jordan Crouse <[email protected]>
Signed-off-by: Rob Clark <[email protected]>

drm/msm/a6xx: fix crashdec section name typo

Backport note: maybe wait some time for the crashdec MR[1] to look for
both the old typo'd name and the corrected name to land in mesa 20.2

[1] https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/6242

Fixes: 1707add81551 ("drm/msm/a6xx: Add a6xx gpu state")
Signed-off-by: Rob Clark <[email protected]>
Reviewed-by: Jordan Crouse <[email protected]>
Signed-off-by: Rob Clark <[email protected]>

drm/msm/a6xx: fix gmu start on newer firmware

New Qualcomm firmware has changed a way it reports back the 'started'
event. Support new register values.

Signed-off-by: Dmitry Baryshkov <[email protected]>
Signed-off-by: Rob Clark <[email protected]>

Merge tag 'kbuild-fixes-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

- move -Wsign-compare warning from W=2 to W=3

- fix the keyword _restrict to __restrict in genksyms

- fix more bugs in qconf

* tag 'kbuild-fixes-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kconfig: qconf: replace deprecated QString::sprintf() with QTextStream
  kconfig: qconf: remove redundant help in the info view
  kconfig: qconf: remove qInfo() to get back Qt4 support
  kconfig: qconf: remove unused colNr
  kconfig: qconf: fix the popup menu in the ConfigInfoView window
  kconfig: qconf: fix signal connection to invalid slots
  genksyms: keywords: Use __restrict not _restrict
  kbuild: remove redundant patterns in filter/filter-out
  extract-cert: add static to local data
  Makefile.extrawarn: Move sign-compare from W=2 to W=3

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Catalin Marinas:

- Allow booting of late secondary CPUs affected by erratum 1418040
   (currently they are parked if none of the early CPUs are affected by
   this erratum).

- Add the 32-bit vdso Makefile to the vdso_install rule so that 'make
   vdso_install' installs the 32-bit compat vdso when it is compiled.

- Print a warning that untrusted guests without a CPU erratum
   workaround (Cortex-A57 832075) may deadlock the affected system.

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  ARM64: vdso32: Install vdso32 from vdso_install
  KVM: arm64: Print warning when cpu erratum can cause guests to deadlock
  arm64: Allow booting of late CPUs affected by erratum 1418040
  arm64: Move handling of erratum 1418040 into C code

Merge tag 's390-5.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 fixes from Vasily Gorbik:

- a couple of fixes for storage key handling relevant for debugging

- add cond_resched into potentially slow subchannels scanning loop

- fixes for PF/VF linking and to ignore stale PCI configuration request
   events

* tag 's390-5.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/pci: fix PF/VF linking on hot plug
  s390/pci: re-introduce zpci_remove_device()
  s390/pci: fix zpci_bus_link_virtfn()
  s390/ptrace: fix storage key handling
  s390/runtime_instrumentation: fix storage key handling
  s390/pci: ignore stale configuration request event
  s390/cio: add cond_resched() in the slow_eval_known_fn() loop

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:

- PAE and PKU bugfixes for x86

- selftests fix for new binutils

- MMU notifier fix for arm64

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: arm64: Only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is not set
  KVM: Pass MMU notifier range flags to kvm_unmap_hva_range()
  kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode
  kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode
  KVM: x86: fix access code passed to gva_to_gpa
  selftests: kvm: Use a shorter encoding to clear RAX

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"23 fixes in 5 drivers (qla2xxx, ufs, scsi_debug, fcoe, zfcp). The bulk
  of the changes are in qla2xxx and ufs and all are mostly small and
  definitely don't impact the core"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (23 commits)
  Revert "scsi: qla2xxx: Disable T10-DIF feature with FC-NVMe during probe"
  Revert "scsi: qla2xxx: Fix crash on qla2x00_mailbox_command"
  scsi: qla2xxx: Fix null pointer access during disconnect from subsystem
  scsi: qla2xxx: Check if FW supports MQ before enabling
  scsi: qla2xxx: Fix WARN_ON in qla_nvme_register_hba
  scsi: qla2xxx: Allow ql2xextended_error_logging special value 1 to be set anytime
  scsi: qla2xxx: Reduce noisy debug message
  scsi: qla2xxx: Fix login timeout
  scsi: qla2xxx: Indicate correct supported speeds for Mezz card
  scsi: qla2xxx: Flush I/O on zone disable
  scsi: qla2xxx: Flush all sessions on zone disable
  scsi: qla2xxx: Use MBX_TOV_SECONDS for mailbox command timeout values
  scsi: scsi_debug: Fix scp is NULL errors
  scsi: zfcp: Fix use-after-free in request timeout handlers
  scsi: ufs: No need to send Abort Task if the task in DB was cleared
  scsi: ufs: Clean up completed request without interrupt notification
  scsi: ufs: Improve interrupt handling for shared interrupts
  scsi: ufs: Fix interrupt error message for shared interrupts
  scsi: ufs-pci: Add quirk for broken auto-hibernate for Intel EHL
  scsi: ufs-mediatek: Fix incorrect time to wait link status
  ...

Merge tag 'devicetree-fixes-for-5.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux

Pull devicetree fixes from Rob Herring:
"Another set of DT fixes:

   - restore range parsing error check

   - workaround PCI range parsing with missing 'device_type' now
     required

   - correct description of 'phy-connection-type'

   - fix erroneous matching on 'snps,dw-pcie' by 'intel,lgm-pcie' schema

   - a couple of grammar and whitespace fixes

   - update Shawn Guo's email"

* tag 'devicetree-fixes-for-5.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
  dt-bindings: vendor-prefixes: Remove trailing whitespace
  dt-bindings: net: correct description of phy-connection-type
  dt-bindings: PCI: intel,lgm-pcie: Fix matching on all snps,dw-pcie instances
  of: address: Work around missing device_type property in pcie nodes
  dt: writing-schema: Miscellaneous grammar fixes
  dt-bindings: Use Shawn Guo's preferred e-mail for i.MX bindings
  of/address: check for invalid range.cpu_addr

habanalabs: correctly report inbound pci region cfg error

During inbound iATU configuration we can get errors while
configuring PCI registers, there is a certain scenario in which these
errors are not reflected and driver is loaded with wrong configuration.

Signed-off-by: Ofir Bitton <[email protected]>
Signed-off-by: Oded Gabbay <[email protected]>

habanalabs: check correct vmalloc return code

vmalloc can return different return code than NULL and a valid
pointer. We must validate it in order to dereference a non valid
pointer.

Signed-off-by: Ofir Bitton <[email protected]>
Signed-off-by: Oded Gabbay <[email protected]>

habanalabs: validate FW file size

We must validate FW size in order not to corrupt memory in case
a malicious FW file will be present in system.

Signed-off-by: Ofir Bitton <[email protected]>
Signed-off-by: Oded Gabbay <[email protected]>

habanalabs: fix incorrect check on failed workqueue create

The null check on a failed workqueue create is currently null checking
hdev->cq_wq rather than the pointer hdev->cq_wq[i] and so the test
will never be true on a failed workqueue create. Fix this by checking
hdev->cq_wq[i].

Addresses-Coverity: ("Dereference before null check")
Fixes: 5574cb2194b1 ("habanalabs: Assign each CQ with its own work queue")
Signed-off-by: Colin Ian King <[email protected]>
Reviewed-by: Oded Gabbay <[email protected]>
Signed-off-by: Oded Gabbay <[email protected]>

habanalabs: set max power according to card type

In Gaudi, the default max power setting is different between PCI and PMC
cards. Therefore, the driver need to set the default after knowing what is
the card type.

The current code has a bug where it limits the maximum power of the PMC
card to 200W after a reset occurs.

Signed-off-by: Oded Gabbay <[email protected]>

habanalabs: proper handling of alloc size in coresight

Allocation size can go up to 64bit but truncated to 32bit,
we should make sure it is not truncated and validate no address
overflow.

Signed-off-by: Ofir Bitton <[email protected]>
Reviewed-by: Oded Gabbay <[email protected]>
Signed-off-by: Oded Gabbay <[email protected]>

habanalabs: set clock gating according to mask

Once clock gating is set we enable clock gating according to mask,
we should also disable clock gating according to relevant bits.

Signed-off-by: Ofir Bitton <[email protected]>
Reviewed-by: Oded Gabbay <[email protected]>
Signed-off-by: Oded Gabbay <[email protected]>

habanalabs: verify user input in cs_ioctl_signal_wait

User input must be validated before using it to
access internal structures.

Signed-off-by: Ofir Bitton <[email protected]>
Reviewed-by: Oded Gabbay <[email protected]>
Signed-off-by: Oded Gabbay <[email protected]>

habanalabs: Fix a loop in gaudi_extract_ecc_info()

The condition was reversed. It should have been less than instead of
greater than. The result is that we never enter the loop.

Fixes: fcc6a4e60678 ("habanalabs: Extract ECC information from FW")
Signed-off-by: Dan Carpenter <[email protected]>
Reviewed-by: Oded Gabbay <[email protected]>
Signed-off-by: Oded Gabbay <[email protected]>

habanalabs: Fix memory corruption in debugfs

This has to be a long instead of a u32 because we write a long value.
On 64 bit systems, this will cause memory corruption.

Fixes: c216477363a3 ("habanalabs: add debugfs support")
Signed-off-by: Dan Carpenter <[email protected]>
Reviewed-by: Oded Gabbay <[email protected]>
Signed-off-by: Oded Gabbay <[email protected]>

habanalabs: validate packet id during CB parse

During command buffer parsing, driver extracts packet id
from user buffer. Driver must validate this packet id, since it is
being used in order to extract information from internal structures.

Signed-off-by: Ofir Bitton <[email protected]>
Reviewed-by: Oded Gabbay <[email protected]>
Signed-off-by: Oded Gabbay <[email protected]>

habanalabs: Validate user address before mapping

User address must be validated before driver performs address map.

Signed-off-by: Ofir Bitton <[email protected]>
Reviewed-by: Oded Gabbay <[email protected]>
Signed-off-by: Oded Gabbay <[email protected]>

habanalabs: unmap PCI bars upon iATU failure

In case the driver fails to configure the PCI controller iATU, it needs to
unmap the PCI bars before exiting so if the driver is removed, the bars
won't be left mapped.

Signed-off-by: Ofir Bitton <[email protected]>
Reviewed-by: Oded Gabbay <[email protected]>
Signed-off-by: Oded Gabbay <[email protected]>

drm/msm: enable vblank during atomic commits

This has roughly the same effect as drm_atomic_helper_wait_for_vblanks(),
basically just ensuring that vblank accounting is enabled so that we get
valid timestamp/seqn on pageflip events.

Signed-off-by: Rob Clark <[email protected]>
Tested-by: Stephen Boyd <[email protected]>
Signed-off-by: Rob Clark <[email protected]>

null_blk: fix passing of REQ_FUA flag in null_handle_rq

REQ_FUA should be checked using rq->cmd_flags instead of req_op().

Fixes: deb78b419dfda ("nullb: emulate cache")
Signed-off-by: Hou Pu <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvmet: Disable keep-alive timer when kato is cleared to 0h

Based on nvme spec, when keep alive timeout is set to zero
the keep-alive timer should be disabled.

Signed-off-by: Amit Engel <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme: redirect commands on dying queue

If a command send through nvme-multipath failed on a dying queue, resend it
on another path.

Signed-off-by: Chao Leng <[email protected]>
[hch: rebased on top of the completion refactoring]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme: just check the status code type in nvme_is_path_error

Check the SCT sub-field for a path related status instead of enumerating
invididual status code. As of NVMe 1.4 this adds "Internal Path Error"
and "Controller Pathing Error" to the list, but it also future proofs for
additional status codes added to the category.

Suggested-by: Chao Leng <[email protected]>
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme: refactor command completion

Lift all the code to decide the dispostition of a completed command
from nvme_complete_rq and nvme_failover_req into a new helper, which
returns an emum of the potential actions. nvme_complete_rq then
just switches on those and calls the proper helper for the action.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme: rename and document nvme_end_request

nvme_end_request is a bit misnamed, as it wraps around the
blk_mq_complete_* API. It's semantics also are non-trivial, so give it
a more descriptive name and add a comment explaining the semantics.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Reviewed-by: Mike Snitzer <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme: skip noiob for zoned devices

Zoned block devices reuse the chunk_sectors queue limit to define zone
boundaries. If a such a device happens to also report an optimal
boundary, do not use that to define the chunk_sectors as that may
intermittently interfere with io splitting and zone size queries.

Signed-off-by: Keith Busch <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme-pci: fix PRP pool size

All operations are based on the controller, not the host page size.
Switch the dma pool to use the controller page size as well to avoid
massive overallocations on large page size systems.

Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth

Recently nvme_dev.q_depth was changed from an int to u16 type.

This falls over for the queue depth calculation in nvme_pci_enable(),
where NVME_CAP_MQES(dev->ctrl.cap) + 1 may overflow as a u16, as
NVME_CAP_MQES() is a 16b number also. That happens for me, and this is the
result:

root@ubuntu:/home/john# [148.272996] Unable to handle kernel NULL pointer
dereference at virtual address 0000000000000010
Mem abort info:
ESR = 0x96000004
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000004
CM = 0, WnR = 0
user pgtable: 4k pages, 48-bit VAs, pgdp=00000a27bf3c9000
[0000000000000010] pgd=0000000000000000, p4d=0000000000000000
Internal error: Oops: 96000004 [#1] PREEMPT SMP
Modules linked in: nvme nvme_core
CPU: 56 PID: 256 Comm: kworker/u195:0 Not tainted
5.8.0-next-20200812 #27
Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
V1.16.01 03/15/2019
Workqueue: nvme-reset-wq nvme_reset_work [nvme]
pstate: 80c00009 (Nzcv daif +PAN +UAO BTYPE=--)
pc : __sg_alloc_table_from_pages+0xec/0x238
lr : __sg_alloc_table_from_pages+0xc8/0x238
sp : ffff800013ccbad0
x29: ffff800013ccbad0 x28: ffff0a27b3d380a8
x27: 0000000000000000 x26: 0000000000002dc2
x25: 0000000000000dc0 x24: 0000000000000000
x23: 0000000000000000 x22: ffff800013ccbbe8
x21: 0000000000000010 x20: 0000000000000000
x19: 00000000fffff000 x18: ffffffffffffffff
x17: 00000000000000c0 x16: fffffe289eaf6380
x15: ffff800011b59948 x14: ffff002bc8fe98f8
x13: ff00000000000000 x12: ffff8000114ca000
x11: 0000000000000000 x10: ffffffffffffffff
x9 : ffffffffffffffc0 x8 : ffff0a27b5f9b6a0
x7 : 0000000000000000 x6 : 0000000000000001
x5 : ffff0a27b5f9b680 x4 : 0000000000000000
x3 : ffff0a27b5f9b680 x2 : 0000000000000000
x1 : 0000000000000001 x0 : 0000000000000000
Call trace:
__sg_alloc_table_from_pages+0xec/0x238
sg_alloc_table_from_pages+0x18/0x28
iommu_dma_alloc+0x474/0x678
dma_alloc_attrs+0xd8/0xf0
nvme_alloc_queue+0x114/0x160 [nvme]
nvme_reset_work+0xb34/0x14b4 [nvme]
process_one_work+0x1e8/0x360
worker_thread+0x44/0x478
kthread+0x150/0x158
ret_from_fork+0x10/0x34
Code: f94002c3 6b01017f 540007c2 11000486 (f8645aa5)
---[ end trace 89bb2b72d59bf925 ]---

Fix by making onto a u32.

Also use u32 for nvme_dev.q_depth, as we assign this value from
nvme_dev.q_depth, and nvme_dev.q_depth will possibly hold 65536 - this
avoids the same crash as above.

Fixes: 61f3b8963097 ("nvme-pci: use unsigned for io queue depth")
Signed-off-by: John Garry <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme: Use spin_lock_irq() when taking the ctrl->lock

When locking the ctrl->lock spinlock IRQs need to be disabled to avoid a
dead lock. The new spin_lock() calls recently added produce the
following lockdep warning when running the blktest nvme/003:

    ================================
    WARNING: inconsistent lock state
    --------------------------------
    inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    ksoftirqd/2/22 [HC0[0]:SC1[1]:HE0:SE0] takes:
    ffff888276a8c4c0 (&ctrl->lock){+.?.}-{2:2}, at: nvme_keep_alive_end_io+0x50/0xc0
    {SOFTIRQ-ON-W} state was registered at:
      lock_acquire+0x164/0x500
      _raw_spin_lock+0x28/0x40
      nvme_get_effects_log+0x37/0x1c0
      nvme_init_identify+0x9e4/0x14f0
      nvme_reset_work+0xadd/0x2360
      process_one_work+0x66b/0xb70
      worker_thread+0x6e/0x6c0
      kthread+0x1e7/0x210
      ret_from_fork+0x22/0x30
    irq event stamp: 1449221
    hardirqs last  enabled at (1449220): [<ffffffff81c58e69>] ktime_get+0xf9/0x140
    hardirqs last disabled at (1449221): [<ffffffff83129665>] _raw_spin_lock_irqsave+0x25/0x60
    softirqs last  enabled at (1449210): [<ffffffff83400447>] __do_softirq+0x447/0x595
    softirqs last disabled at (1449215): [<ffffffff81b489b5>] run_ksoftirqd+0x35/0x50

    other info that might help us debug this:
     Possible unsafe locking scenario:

           CPU0
           ----
      lock(&ctrl->lock);
      <Interrupt>
        lock(&ctrl->lock);

     *** DEADLOCK ***

    no locks held by ksoftirqd/2/22.

    stack backtrace:
    CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 5.8.0-rc4-eid-vmlocalyes-dbg-00157-g7236657c6b3a #1450
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014
    Call Trace:
     dump_stack+0xc8/0x11a
     print_usage_bug.cold.63+0x235/0x23e
     mark_lock+0xa9c/0xcf0
     __lock_acquire+0xd9a/0x2b50
     lock_acquire+0x164/0x500
     _raw_spin_lock_irqsave+0x40/0x60
     nvme_keep_alive_end_io+0x50/0xc0
     blk_mq_end_request+0x158/0x210
     nvme_complete_rq+0x146/0x500
     nvme_loop_complete_rq+0x26/0x30 [nvme_loop]
     blk_done_softirq+0x187/0x1e0
     __do_softirq+0x118/0x595
     run_ksoftirqd+0x35/0x50
     smpboot_thread_fn+0x1d3/0x310
     kthread+0x1e7/0x210
     ret_from_fork+0x22/0x30

Fixes: be93e87e7802 ("nvme: support for multiple Command Sets Supported and Effects log pages")
Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Tested-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvmet: call blk_mq_free_request() directly

Instead of calling blk_put_request() which calls blk_mq_free_request(),
call blk_mq_free_request() directly for NVMeOF passthru. This is to
mainly avoid an extra function call in the completion path
nvmet_passthru_req_done().

Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Reviewed-by: Logan Gunthorpe <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvmet: fix oops in pt cmd execution

In the existing NVMeOF Passthru core command handling on failure of
nvme_alloc_request() it errors out with rq value set to NULL. In the
error handling path it calls blk_put_request() without checking if
rq is set to NULL or not which produces following Oops:-

[ 1457.346861] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 1457.347838] #PF: supervisor read access in kernel mode
[ 1457.348464] #PF: error_code(0x0000) - not-present page
[ 1457.349085] PGD 0 P4D 0
[ 1457.349402] Oops: 0000 [#1] SMP NOPTI
[ 1457.349851] CPU: 18 PID: 10782 Comm: kworker/18:2 Tainted: G           OE     5.8.0-rc4nvme-5.9+ #35
[ 1457.350951] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e3214
[ 1457.352347] Workqueue: events nvme_loop_execute_work [nvme_loop]
[ 1457.353062] RIP: 0010:blk_mq_free_request+0xe/0x110
[ 1457.353651] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8
[ 1457.355975] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282
[ 1457.356636] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
[ 1457.357526] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000
[ 1457.358416] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000
[ 1457.359317] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000
[ 1457.360424] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8
[ 1457.361322] FS:  0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000
[ 1457.362337] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1457.363058] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0
[ 1457.363973] Call Trace:
[ 1457.364296]  nvmet_passthru_execute_cmd+0x150/0x2c0 [nvmet]
[ 1457.364990]  process_one_work+0x24e/0x5a0
[ 1457.365493]  ? __schedule+0x353/0x840
[ 1457.365957]  worker_thread+0x3c/0x380
[ 1457.366426]  ? process_one_work+0x5a0/0x5a0
[ 1457.366948]  kthread+0x135/0x150
[ 1457.367362]  ? kthread_create_on_node+0x60/0x60
[ 1457.367934]  ret_from_fork+0x22/0x30
[ 1457.368388] Modules linked in: nvme_loop(OE) nvmet(OE) nvme_fabrics(OE) null_blk nvme(OE) nvme_corer
[ 1457.368414]  ata_piix crc32c_intel virtio_pci libata virtio_ring serio_raw t10_pi virtio floppy dm_]
[ 1457.380849] CR2: 0000000000000000
[ 1457.381288] ---[ end trace c6cab61bfd1f68fd ]---
[ 1457.381861] RIP: 0010:blk_mq_free_request+0xe/0x110
[ 1457.382469] Code: 3f ff ff ff 83 f8 01 75 0d 4c 89 e7 e8 1b db ff ff e9 2d ff ff ff 0f 0b eb ef 66 8
[ 1457.384749] RSP: 0018:ffffc900035b7de0 EFLAGS: 00010282
[ 1457.385393] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000002
[ 1457.386264] RDX: ffffffffa060bd05 RSI: 0000000000000000 RDI: 0000000000000000
[ 1457.387142] RBP: 0000000000000037 R08: 0000000000000000 R09: 0000000000000000
[ 1457.388029] R10: 0000000000000000 R11: 000000000000006d R12: 0000000000000000
[ 1457.388914] R13: ffff8887ffa68600 R14: 0000000000000000 R15: ffff8888150564c8
[ 1457.389798] FS:  0000000000000000(0000) GS:ffff888814600000(0000) knlGS:0000000000000000
[ 1457.390796] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1457.391508] CR2: 0000000000000000 CR3: 000000081c0ac000 CR4: 00000000003406e0
[ 1457.392525] Kernel panic - not syncing: Fatal exception
[ 1457.394138] Kernel Offset: disabled
[ 1457.394677] ---[ end Kernel panic - not syncing: Fatal exception ]---

We fix this Oops by adding a new goto label out_put_req and reordering
the blk_put_request call to avoid calling blk_put_request() with rq
value is set to NULL. Here we also update the rest of the code
accordingly.

Fixes: 06b7164dfdc0 ("nvmet: add passthru code to process commands")
Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Logan Gunthorpe <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvmet: add ns tear down label for pt-cmd handling

In the current implementation before submitting the passthru cmd we
may come across error e.g. getting ns from passthru controller,
allocating a request from passthru controller, etc. For all the failure
cases it only uses single goto label fail_out.

In the target code, we follow the pattern to have a separate label for
each error out the case when setting up multiple things before the actual
action.

This patch follows the same pattern and renames generic fail_out label
to out_put_ns and updates the error out cases in the
nvmet_passthru_execute_cmd() where it is needed.

Signed-off-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Logan Gunthorpe <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme: multipath: round-robin: eliminate "fallback" variable

If we find an optimized path, we quit the loop immediately. Thus we can use
just one variable for the next path, slighly simplifying the code.

Signed-off-by: Martin Wilck <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme: multipath: round-robin: fix single non-optimized path case

If there's only one usable, non-optimized path, nvme_round_robin_path()
returns NULL, which is wrong. Fix it by falling back to "old", like in
the single optimized path case. Also, if the active path isn't changed,
there's no need to re-assign the pointer.

Fixes: 3f6e3246db0e ("nvme-multipath: fix logic for non-optimized paths")
Signed-off-by: Martin Wilck <[email protected]>
Signed-off-by: Martin George <[email protected]>
Reviewed-by: Keith Busch <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvme-fc: Fix wrong return value in __nvme_fc_init_request()

On an error exit path, a negative error code should be returned
instead of a positive return value.

Fixes: e399441de9115 ("nvme-fabrics: Add host support for FC transport")
Cc: James Smart <[email protected]>
Signed-off-by: Tianjia Zhang <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvmet-passthru: Reject commands with non-sgl flags set

Any command with a non-SGL flag set (like fuse flags) should be
rejected.

Fixes: c1fef73f793b ("nvmet: add passthru code to process commands")
Signed-off-by: Logan Gunthorpe <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

nvmet: fix a memory leak

We forgot to free new_model_number

Fixes: 013b7ebe5a0d ("nvmet: make ctrl model configurable")
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Sagi Grimberg <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

blkcg: fix memleak for iolatency

Normally, blkcg_iolatency_exit() will free related memory in iolatency
when cleanup queue. But if blk_throtl_init() return error and queue init
fail, blkcg_iolatency_exit() will not do that for us. Then it cause
memory leak.

Fixes: d70675121546 ("block: introduce blk-iolatency io controller")
Signed-off-by: Yufen Yu <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

MAINTAINERS: Add missing header files to BLOCK LAYER section

The various <linux/blk*.h> header files are part of the Block Layer.
Add them to the corresponding section in the MAINTAINERS file, so
scripts/get_maintainer.pl will pick them up.

Signed-off-by: Geert Uytterhoeven <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block: fix get_max_io_size()

A previous commit aligning splits to physical block sizes inadvertently
modified one return case such that that it now returns 0 length splits
when the number of sectors doesn't exceed the physical offset. This
later hits a BUG in bio_split(). Restore the previous working behavior.

Fixes: 9cc5169cd478b ("block: Improve physical block alignment of split bios")
Reported-by: Eric Deal <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
Cc: Bart Van Assche <[email protected]>
Cc: [email protected]
Signed-off-by: Jens Axboe <[email protected]>

blk-mq: insert request not through ->queue_rq into sw/scheduler queue

c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list") supposed
to add request which has been through ->queue_rq() to the hw queue dispatch
list, however it adds request running out of budget or driver tag to hw queue
too. This way basically bypasses request merge, and causes too many request
dispatched to LLD, and system% is unnecessary increased.

Fixes this issue by adding request not through ->queue_rq into sw/scheduler
queue, and this way is safe because no ->queue_rq is called on this request
yet.

High %system can be observed on Azure storvsc device, and even soft lock
is observed. This patch reduces %system during heavy sequential IO,
meantime decreases soft lockup risk.

Fixes: c616cbee97ae ("blk-mq: punt failed direct issue to dispatch list")
Signed-off-by: Ming Lei <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Bart Van Assche <[email protected]>
Cc: Mike Snitzer <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

block/rnbd: Ensure err is always initialized in process_rdma

Clang warns:

drivers/block/rnbd/rnbd-srv.c:150:6: warning: variable 'err' is used
uninitialized whenever 'if' condition is true
[-Wsometimes-uninitialized]
        if (IS_ERR(bio)) {
            ^~~~~~~~~~~
drivers/block/rnbd/rnbd-srv.c:177:9: note: uninitialized use occurs here
        return err;
               ^~~
drivers/block/rnbd/rnbd-srv.c:150:2: note: remove the 'if' if its
condition is always false
        if (IS_ERR(bio)) {
        ^~~~~~~~~~~~~~~~~~
drivers/block/rnbd/rnbd-srv.c:126:9: note: initialize the variable 'err'
to silence this warning
        int err;
               ^
                = 0
1 warning generated.

err is indeed uninitialized when this statement is taken. Ensure that it
is assigned the error value of bio before jumping to the error handling
label.

Fixes: 735d77d4fd28 ("rnbd: remove rnbd_dev_submit_io")
Reported-by: Brooke Basile <[email protected]>
Signed-off-by: Nathan Chancellor <[email protected]>
Acked-by: Jack Wang <[email protected]>
Link: https://github.com/ClangBuiltLinux/linux/issues/1134
Signed-off-by: Jens Axboe <[email protected]>

dt-bindings: vendor-prefixes: Remove trailing whitespace

Fixes: f516fb704d02fff2 ("dt-bindings: Whitespace clean-ups in schema files")
Signed-off-by: Geert Uytterhoeven <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Rob Herring <[email protected]>

KVM: arm64: Only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is not set

When an MMU notifier call results in unmapping a range that spans multiple
PGDs, we end up calling into cond_resched_lock() when crossing a PGD boundary,
since this avoids running into RCU stalls during VM teardown. Unfortunately,
if the VM is destroyed as a result of OOM, then blocking is not permitted
and the call to the scheduler triggers the following BUG():

| BUG: sleeping function called from invalid context at arch/arm64/kvm/mmu.c:394
| in_atomic(): 1, irqs_disabled(): 0, non_block: 1, pid: 36, name: oom_reaper
| INFO: lockdep is turned off.
| CPU: 3 PID: 36 Comm: oom_reaper Not tainted 5.8.0 #1
| Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
| Call trace:
|  dump_backtrace+0x0/0x284
|  show_stack+0x1c/0x28
|  dump_stack+0xf0/0x1a4
|  ___might_sleep+0x2bc/0x2cc
|  unmap_stage2_range+0x160/0x1ac
|  kvm_unmap_hva_range+0x1a0/0x1c8
|  kvm_mmu_notifier_invalidate_range_start+0x8c/0xf8
|  __mmu_notifier_invalidate_range_start+0x218/0x31c
|  mmu_notifier_invalidate_range_start_nonblock+0x78/0xb0
|  __oom_reap_task_mm+0x128/0x268
|  oom_reap_task+0xac/0x298
|  oom_reaper+0x178/0x17c
|  kthread+0x1e4/0x1fc
|  ret_from_fork+0x10/0x30

Use the new 'flags' argument to kvm_unmap_hva_range() to ensure that we
only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is set in the notifier
flags.

Cc: <[email protected]>
Fixes: 8b3405e345b5 ("kvm: arm/arm64: Fix locking for kvm_free_stage2_pgd")
Cc: Marc Zyngier <[email protected]>
Cc: Suzuki K Poulose <[email protected]>
Cc: James Morse <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Message-Id: <20200811102725 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: Pass MMU notifier range flags to kvm_unmap_hva_range()

The 'flags' field of 'struct mmu_notifier_range' is used to indicate
whether invalidate_range_{start,end}() are permitted to block. In the
case of kvm_mmu_notifier_invalidate_range_start(), this field is not
forwarded on to the architecture-specific implementation of
kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
whether or not to block.

Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
architectures are aware as to whether or not they are permitted to block.

Cc: <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Suzuki K Poulose <[email protected]>
Cc: James Morse <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
Message-Id: <20200811102725 [email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>

dt-bindings: net: correct description of phy-connection-type

The phy-connection-type parameter is described in ePAPR 1.1:

Specifies interface type between the Ethernet device and a physical
layer (PHY) device. The value of this property is specific to the
implementation.

Signed-off-by: Madalin Bucur <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Rob Herring <[email protected]>

Merge tag 'io_uring-5.9-2020-08-21' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:

- Make sure the head link cancelation includes async work

- Get rid of kiocb_wait_page_queue_init(), makes no sense to have it as
   a separate function since you moved it into io_uring itself

- io_import_iovec cleanups (Pavel, me)

- Use system_unbound_wq for ring exit work, to avoid spawning tons of
   these if we have tons of rings exiting at the same time

- Fix req->flags overflow flag manipulation (Pavel)

* tag 'io_uring-5.9-2020-08-21' of git://git.kernel.dk/linux-block:
  io_uring: kill extra iovec=NULL in import_iovec()
  io_uring: comment on kfree(iovec) checks
  io_uring: fix racy req->flags modification
  io_uring: use system_unbound_wq for ring exit work
  io_uring: cleanup io_import_iovec() of pre-mapped request
  io_uring: get rid of kiocb_wait_page_queue_init()
  io_uring: find and cancel head link async work on files exit

dt-bindings: PCI: intel,lgm-pcie: Fix matching on all snps,dw-pcie instances

The intel,lgm-pcie binding is matching on all snps,dw-pcie instances
which is wrong. Add a custom 'select' entry to fix this.

Fixes: e54ea45a4955 ("dt-bindings: PCI: intel: Add YAML schemas for the PCIe RC controller")
Cc: Bjorn Helgaas <[email protected]>
Cc: [email protected]
Reviewed-by: Dilip Kota <[email protected]>
Signed-off-by: Rob Herring <[email protected]>

Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
"11 patches.

  Subsystems affected by this: misc, mm/hugetlb, mm/vmalloc, mm/misc,
  romfs, relay, uprobes, squashfs, mm/cma, mm/pagealloc"

* emailed patches from Andrew Morton <[email protected]>:
  mm, page_alloc: fix core hung in free_pcppages_bulk()
  mm: include CMA pages in lowmem_reserve at boot
  squashfs: avoid bio_alloc() failure with 1Mbyte blocks
  uprobes: __replace_page() avoid BUG in munlock_vma_page()
  kernel/relay.c: fix memleak on destroy relay channel
  romfs: fix uninitialized memory leak in romfs_dev_read()
  mm/rodata_test.c: fix missing function declaration
  mm/vunmap: add cond_resched() in vunmap_pmd_range
  khugepaged: adjust VM_BUG_ON_MM() in __khugepaged_enter()
  hugetlb_cgroup: convert comma to semicolon
  mailmap: add Andi Kleen

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Alexei Starovoitov says:

====================
pull-request: bpf 2020-08-21

The following pull-request contains BPF updates for your *net* tree.

We've added 11 non-merge commits during the last 5 day(s) which contain
a total of 12 files changed, 78 insertions(+), 24 deletions(-).

The main changes are:

1) three fixes in BPF task iterator logic, from Yonghong.

2) fix for compressed dwarf sections in vmlinux, from Jiri.

3) fix xdp attach regression, from Andrii.
====================

Signed-off-by: David S. Miller <[email protected]>

Merge tag 'riscv-for-linus-5.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:

- The CLINT driver has been split in two: one to handle the M-mode
   CLINT (memory mapped and used on NOMMU systems) and one to handle the
   S-mode CLINT (via SBI).

- The addition of SiFive's drivers to rv32_defconfig

* tag 'riscv-for-linus-5.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: Add SiFive drivers to rv32_defconfig
  dt-bindings: timer: Add CLINT bindings
  RISC-V: Remove CLINT related code from timer and arch
  clocksource/drivers: Add CLINT timer driver
  RISC-V: Add mechanism to provide custom IPI operations

Merge tag 'for-linus-5.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull xen fixes from Juergen Gross:
"One build fix and a minor fix for suppressing a useless warning when
  booting a Xen dom0 via UEFI"

* tag 'for-linus-5.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  Fix build error when CONFIG_ACPI is not set/enabled:
  efi: avoid error message when booting under Xen

Merge tag 'pm-5.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
"These fix a few issues in the operating performance points (OPP)
  framework.

  Specifics:

   - Fix re-enabling of resources in dev_pm_opp_set_rate() (Rajendra
     Nayak)

   - Fix OPP table reference counting in error paths (Stephen Boyd)"

* tag 'pm-5.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  opp: Enable resources again if they were disabled earlier
  opp: Put opp table in dev_pm_opp_set_rate() if _set_opp_bw() fails
  opp: Put opp table in dev_pm_opp_set_rate() for empty tables

bpf: Fix two typos in uapi/linux/bpf.h

Also remove trailing whitespaces in bpf_skb_get_tunnel_key example code.

Signed-off-by: Tobias Klauser <[email protected]>
Signed-off-by: Alexei Starovoitov <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

net: dsa: b53: check for timeout

clang static analysis reports this problem

b53_common.c:1583:13: warning: The left expression of the compound
  assignment is an uninitialized value. The computed value will
  also be garbage
        ent.port &= ~BIT(port);
        ~~~~~~~~ ^

ent is set by a successful call to b53_arl_read().  Unsuccessful
calls are caught by an switch statement handling specific returns.
b32_arl_read() calls b53_arl_op_wait() which fails with the
unhandled -ETIMEDOUT.

So add -ETIMEDOUT to the switch statement.  Because
b53_arl_op_wait() already prints out a message, do not add another
one.

Fixes: 1da6df85c6fb ("net: dsa: b53: Implement ARL add/del/dump operations")
Signed-off-by: Tom Rix <[email protected]>
Acked-by: Florian Fainelli <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

hwmon: (applesmc) check status earlier.

clang static analysis reports this representative problem

applesmc.c:758:10: warning: 1st function call argument is an
  uninitialized value
        left = be16_to_cpu(*(__be16 *)(buffer + 6)) >> 2;
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

buffer is filled by the earlier call

ret = applesmc_read_key(LIGHT_SENSOR_LEFT_KEY, ...

This problem is reported because a goto skips the status check.
Other similar problems use data from applesmc_read_key before checking
the status.  So move the checks to before the use.

Signed-off-by: Tom Rix <[email protected]>
Reviewed-by: Henrik Rydberg <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Guenter Roeck <[email protected]>

hwmon: (nct7904) Correct divide by 0

We hit a kernel panic due to a divide by 0 in nct7904_read_fan() for
the hwmon_fan_min case. Extend the check to hwmon_fan_input case as well
for safety.

[ 1656.545650] divide error: 0000 [#1] SMP PTI
[ 1656.545779] CPU: 12 PID: 18010 Comm: sensors Not tainted 5.4.47 #1
[ 1656.546065] RIP: 0010:nct7904_read+0x1e9/0x510 [nct7904]
...
[ 1656.546549] RAX: 0000000000149970 RBX: ffffbd6b86bcbe08 RCX: 0000000000000000
...
[ 1656.547548] Call Trace:
[ 1656.547665]  hwmon_attr_show+0x32/0xd0 [hwmon]
[ 1656.547783]  dev_attr_show+0x18/0x50
[ 1656.547898]  sysfs_kf_seq_show+0x99/0x120
[ 1656.548013]  seq_read+0xd8/0x3e0
[ 1656.548127]  vfs_read+0x89/0x130
[ 1656.548234]  ksys_read+0x7d/0xb0
[ 1656.548342]  do_syscall_64+0x48/0x110
[ 1656.548451]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: d65a5102a99f5 ("hwmon: (nct7904) Convert to use new hwmon registration API")
Signed-off-by: Jason Baron <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Guenter Roeck <[email protected]>

device property: Fix the secondary firmware node handling in set_primary_fwnode()

When the primary firmware node pointer is removed from a
device (set to NULL) the secondary firmware node pointer,
when it exists, is made the primary node for the device.
However, the secondary firmware node pointer of the original
primary firmware node is never cleared (set to NULL).

To avoid situation where the secondary firmware node pointer
is pointing to a non-existing object, clearing it properly
when the primary node is removed from a device in
set_primary_fwnode().

Fixes: 97badf873ab6 ("device property: Make it possible to use secondary firmware nodes")
Cc: All applicable <[email protected]>
Signed-off-by: Heikki Krogerus <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>

ACPI: ioremap: avoid redundant rounding to OS page size

The arm64 implementation of acpi_os_ioremap() was recently updated to
tighten the checks around which parts of memory are permitted to be
mapped by ACPI code, which generally only needs access to memory regions
that are statically described by firmware, and any attempts to access
memory that is in active use by the OS is generally a bug or a hacking
attempt. This tightening is based on the EFI memory map, which describes
all memory in the system.

The AArch64 architecture permits page sizes of 16k and 64k in addition
to the EFI default, which is 4k, which means that the EFI memory map may
describe regions that cannot be mapped seamlessly if the OS page size is
greater than 4k. This is usually not a problem, given that the EFI spec
does not permit memory regions requiring different memory attributes to
share a 64k page frame, and so the usual rounding to page size performed
by ioremap() is sufficient to deal with this. However, this rounding does
complicate our EFI memory map permission check, due to the loss of
information that occurs when several small regions share a single 64k
page frame (where rounding each of them will result in the same 64k
single page region).

However, due to the fact that the region check occurs *before* the call
to ioremap() where the necessary rounding is performed, we can deal
with this issue simply by removing the redundant rounding performed by
acpi_os_map_iomem(), as it appears to be the only place where the
arguments to a call to acpi_os_ioremap() are rounded up. So omit the
rounding in the call, and instead, apply the necessary masking when
assigning the map->virt member.

Fixes: 1583052d111f ("arm64/acpi: disallow AML memory opregions to access kernel memory")
Signed-off-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>

ACPI: SoC: APD: Check return value of acpi_dev_get_property()

`fch_misc_setup()` uses `acpi_dev_get_property()` to read the value of
"is-rv" passed in by BIOS in ACPI tables. However, not all BIOSes
might pass in this property and hence it is important to first check
the return value of `acpi_dev_get_property()` before referencing the
object filled by it.

Signed-off-by: Furquan Shaikh <[email protected]>
[ rjw: Subject edits ]
Signed-off-by: Rafael J. Wysocki <[email protected]>

cpufreq: replace cpu_logical_map() with read_cpuid_mpir()

Commit eaecca9e7710 ("arm64: Fix __cpu_logical_map undefined issue")
fixes the issue with building tegra194 cpufreq driver as module. But
the fix might cause problem while supporting physical CPU hotplug[1].

This patch fixes the original problem by avoiding use of cpu_logical_map().
Instead calling read_cpuid_mpidr() to get MPIDR on target CPU.

[1] https://lore.kernel.org/linux-arm-kernel/20200724131059.GB6521@bogus/

Fixes: df320f89359c ("cpufreq: Add Tegra194 cpufreq driver")
Reviewed-by: Sudeep Holla <[email protected]>
Signed-off-by: Sumit Gupta <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
[ rjw: Subject & changelog edits ]
Signed-off-by: Rafael J. Wysocki <[email protected]>

ARM64: vdso32: Install vdso32 from vdso_install

Add the 32-bit vdso Makefile to the vdso_install rule so that 'make
vdso_install' installs the 32-bit compat vdso when it is compiled.

Fixes: a7f71a2c8903 ("arm64: compat: Add vDSO")
Signed-off-by: Stephen Boyd <[email protected]>
Reviewed-by: Vincenzo Frascino <[email protected]>
Acked-by: Will Deacon <[email protected]>
Cc: Vincenzo Frascino <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
"Improvements to ext4's block allocator performance for very large file
  systems, especially when the file system or files which are highly
  fragmented. There is a new mount option, prefetch_block_bitmaps which
  will pull in the block bitmaps and set up the in-memory buddy bitmaps
  when the file system is initially mounted.

  Beyond that, a lot of bug fixes and cleanups. In particular, a number
  of changes to make ext4 more robust in the face of write errors or
  file system corruptions"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (46 commits)
  ext4: limit the length of per-inode prealloc list
  ext4: reorganize if statement of ext4_mb_release_context()
  ext4: add mb_debug logging when there are lost chunks
  ext4: Fix comment typo "the the".
  jbd2: clean up checksum verification in do_one_pass()
  ext4: change to use fallthrough macro
  ext4: remove unused parameter of ext4_generic_delete_entry function
  mballoc: replace seq_printf with seq_puts
  ext4: optimize the implementation of ext4_mb_good_group()
  ext4: delete invalid comments near ext4_mb_check_limits()
  ext4: fix typos in ext4_mb_regular_allocator() comment
  ext4: fix checking of directory entry validity for inline directories
  fs: prevent BUG_ON in submit_bh_wbc()
  ext4: correctly restore system zone info when remount fails
  ext4: handle add_system_zone() failure in ext4_setup_system_zone()
  ext4: fold ext4_data_block_valid_rcu() into the caller
  ext4: check journal inode extents more carefully
  ext4: don't allow overlapping system zones
  ext4: handle error of ext4_setup_system_zone() on remount
  ext4: delete the invalid BUGON in ext4_mb_load_buddy_gfp()
  ...

afs: Fix NULL deref in afs_dynroot_depopulate()

If an error occurs during the construction of an afs superblock, it's
possible that an error occurs after a superblock is created, but before
we've created the root dentry. If the superblock has a dynamic root
(ie. what's normally mounted on /afs), the afs_kill_super() will call
afs_dynroot_depopulate() to unpin any created dentries - but this will
oops if the root hasn't been created yet.

Fix this by skipping that bit of code if there is no root dentry.

This leads to an oops looking like:

general protection fault, ...
KASAN: null-ptr-deref in range [0x0000000000000068-0x000000000000006f]
...
RIP: 0010:afs_dynroot_depopulate+0x25f/0x529 fs/afs/dynroot.c:385
...
Call Trace:
afs_kill_super+0x13b/0x180 fs/afs/super.c:535
deactivate_locked_super+0x94/0x160 fs/super.c:335
afs_get_tree+0x1124/0x1460 fs/afs/super.c:598
vfs_get_tree+0x89/0x2f0 fs/super.c:1547
do_new_mount fs/namespace.c:2875 [inline]
path_mount+0x1387/0x2070 fs/namespace.c:3192
do_mount fs/namespace.c:3205 [inline]
__do_sys_mount fs/namespace.c:3413 [inline]
__se_sys_mount fs/namespace.c:3390 [inline]
__x64_sys_mount+0x27f/0x300 fs/namespace.c:3390
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9

which is oopsing on this line:

inode_lock(root->d_inode);

presumably because sb->s_root was NULL.

Fixes: 0da0b7fd73e4 ("afs: Display manually added cells in dynamic root mount")
Reported-by: [email protected]
Signed-off-by: David Howells <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:
"One regression from 5.8 and a few bugs from earlier kernels:

   - Various spelling corrections in kernel prints

   - Bug fixes in hfi1 and bntx_re

   - Revert a 5.8 patch in hns

   - Batch update for Mellanox and Cumulus maintainers emails"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  MAINTAINERS: Update Mellanox and Cumulus Network addresses to new domain
  Revert "RDMA/hns: Reserve one sge in order to avoid local length error"
  RDMA/hfi1: Correct an interlock issue for TID RDMA WRITE request
  RDMA/bnxt_re: Do not add user qps to flushlist
  RDMA/core: Fix spelling mistake "Could't" -> "Couldn't"
  RDMA/usnic: Fix spelling mistake "transistion" -> "transition"
  RDMA/hns: Fix spelling mistake "epmty" -> "empty"

Merge tag 'sound-5.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"A collection of small fixes over several drivers, but all are driver-
  specific and nothing looks scary.

  Slightly large changes are seen in ASoC qcom driver for the bugs that
  were revealed by the recent ASoC core change to report the invalid
  register access errors. Also ASoC fsl got a slight intensive change
  for the distortion fix.

  Others are only trivial fixes or device-specific quirks"

* tag 'sound-5.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (25 commits)
  ALSA: hda: avoid reset of sdo_limit
  ALSA: hda/realtek: Add quirk for Samsung Galaxy Book Ion
  ALSA: usb-audio: ignore broken processing/extension unit
  ASoC: intel: Fix memleak in sst_media_open
  ASoC: wm8994: Avoid attempts to read unreadable registers
  ASoC: msm8916-wcd-analog: fix register Interrupt offset
  ASoC: wm8994: Prevent access to invalid VU register bits on WM1811
  ALSA: hda/realtek: Add model alc298-samsung-headphone
  ALSA: usb-audio: Update documentation comment for MS2109 quirk
  ALSA: isa: fix spelling mistakes in the comments
  ALSA: usb-audio: Add capture support for Saffire 6 (USB 1.1)
  ALSA: hda/realtek: Add quirk for Samsung Galaxy Flex Book
  ASoC: q6routing: add dummy register read/write function
  ASoC: q6afe-dai: mark all widgets registers as SND_SOC_NOPM
  ASoC: Make soc_component_read() returning an error code again
  ASoC: amd: Replacing component->name with codec_dai->name.
  ASoC: fsl: Fix unused variable warning
  ASoC: tegra: tegra210_i2s: Fix compile warning with CONFIG_PM=n
  ASoC: tegra: tegra210_dmic: Fix compile warning with CONFIG_PM=n
  ASoC: tegra: tegra210_ahub: Fix compile warning with CONFIG_PM=n
  ...

Merge tag 'drm-fixes-2020-08-21' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
"Regular fixes pull for rc2. Usual rc2 doesn't seem too busy, mainly
  i915 and amdgpu. I'd expect the usual uptick for rc3.

  amdgpu:
   - Fix allocation size
   - SR-IOV fixes
   - Vega20 SMU feature state caching fix
   - Fix custom pptable handling
   - Arcturus golden settings update
   - Several display fixes
   - Fixes for Navy Flounder
   - Misc display fixes
   - RAS fix

  amdkfd:
   - SDMA fix for renoir

  i915:
   - Fix device parameter usage for selftest mock i915 device
   - Fix LPSP capability debugfs NULL dereference
   - Fix buddy register pagemask table
   - Fix intel_atomic_check() non-negative return value
   - Fix selftests passing a random 0 into ilog2()
   - Fix TGL power well enable/disable ordering
   - Switch to PMU module refcounting
   - GVT fixes

  virtio:
   - Add missing dma_fence_put() in virtio_gpu_execbuffer_ioctl()
   - Fix memory leak in virtio_gpu_cleanup_object()"

* tag 'drm-fixes-2020-08-21' of git://anongit.freedesktop.org/drm/drm: (34 commits)
  Revert "drm/amdgpu: disable gfxoff for navy_flounder"
  drm/i915/tgl: Make sure TC-cold is blocked before enabling TC AUX power wells
  drm/i915/selftests: Avoid passing a random 0 into ilog2
  drm/i915: Fix wrong return value in intel_atomic_check()
  drm/i915: Update bw_buddy pagemask table
  drm/i915/display: Check for an LPSP encoder before dereferencing
  drm/i915: Copy default modparams to mock i915_device
  drm/i915: Provide the perf pmu.module
  drm/amd/display: fix pow() crashing when given base 0
  drm/amd/display: Reset scrambling on Test Pattern
  drm/amd/display: fix dcn3 wide timing dsc validation
  drm/amd/display: Fix DFPstate hang due to view port changed
  drm/amd/display: Assign correct left shift
  drm/amd/display: Call DMUB for eDP power control
  drm/amdkfd: fix the wrong sdma instance query for renoir
  drm/amdgpu: parse ta firmware for navy_flounder
  drm/amdgpu: fix NULL pointer access issue when unloading driver
  drm/amdgpu: fix uninit-value in arcturus_log_thermal_throttling_event()
  drm/amdgpu: disable gfxoff for navy_flounder
  drm/amdgpu/display: use GFP_ATOMIC in dcn20_validate_bandwidth_internal
  ...

mm, page_alloc: fix core hung in free_pcppages_bulk()

The following race is observed with the repeated online, offline and a
delay between two successive online of memory blocks of movable zone.

P1 P2

Online the first memory block in
the movable zone. The pcp struct
values are initialized to default
values,i.e., pcp->high = 0 &
pcp->batch = 1.

Allocate the pages from the
movable zone.

Try to Online the second memory
block in the movable zone thus it
entered the online_pages() but yet
to call zone_pcp_update().
This process is entered into
the exit path thus it tries
to release the order-0 pages
to pcp lists through
free_unref_page_commit().
As pcp->high = 0, pcp->count = 1
proceed to call the function
free_pcppages_bulk().
Update the pcp values thus the
new pcp values are like, say,
pcp->high = 378, pcp->batch = 63.
Read the pcp's batch value using
READ_ONCE() and pass the same to
free_pcppages_bulk(), pcp values
passed here are, batch = 63,
count = 1.

Since num of pages in the pcp
lists are less than ->batch,
then it will stuck in
while(list_empty(list)) loop
with interrupts disabled thus
a core hung.

Avoid this by ensuring free_pcppages_bulk() is called with proper count of
pcp list pages.

The mentioned race is some what easily reproducible without [1] because
pcp's are not updated for the first memory block online and thus there is
a enough race window for P2 between alloc+free and pcp struct values
update through onlining of second memory block.

With [1], the race still exists but it is very narrow as we update the pcp
struct values for the first memory block online itself.

This is not limited to the movable zone, it could also happen in cases
with the normal zone (e.g., hotplug to a node that only has DMA memory, or
no other memory yet).

[1]: https://patchwork.kernel.org/patch/11696389/

Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into one-list-per-migrate-type")
Signed-off-by: Charan Teja Reddy <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Vinayak Menon <[email protected]>
Cc: <[email protected]> [2.6+]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

mm: include CMA pages in lowmem_reserve at boot

The lowmem_reserve arrays provide a means of applying pressure against
allocations from lower zones that were targeted at higher zones.  Its
values are a function of the number of pages managed by higher zones and
are assigned by a call to the setup_per_zone_lowmem_reserve() function.

The function is initially called at boot time by the function
init_per_zone_wmark_min() and may be called later by accesses of the
/proc/sys/vm/lowmem_reserve_ratio sysctl file.

The function init_per_zone_wmark_min() was moved up from a module_init to
a core_initcall to resolve a sequencing issue with khugepaged.
Unfortunately this created a sequencing issue with CMA page accounting.

The CMA pages are added to the managed page count of a zone when
cma_init_reserved_areas() is called at boot also as a core_initcall.  This
makes it uncertain whether the CMA pages will be added to the managed page
counts of their zones before or after the call to
init_per_zone_wmark_min() as it becomes dependent on link order.  With the
current link order the pages are added to the managed count after the
lowmem_reserve arrays are initialized at boot.

This means the lowmem_reserve values at boot may be lower than the values
used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the
ratio values are unchanged.

In many cases the difference is not significant, but for example
an ARM platform with 1GB of memory and the following memory layout

  cma: Reserved 256 MiB at 0x0000000030000000
  Zone ranges:
    DMA      [mem 0x0000000000000000-0x000000002fffffff]
    Normal   empty
    HighMem  [mem 0x0000000030000000-0x000000003fffffff]

would result in 0 lowmem_reserve for the DMA zone.  This would allow
userspace to deplete the DMA zone easily.

Funnily enough

  $ cat /proc/sys/vm/lowmem_reserve_ratio

would fix up the situation because as a side effect it forces
setup_per_zone_lowmem_reserve.

This commit breaks the link order dependency by invoking
init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages
have the chance to be properly accounted in their zone(s) and allowing
the lowmem_reserve arrays to receive consistent values.

Fixes: bc22af74f271 ("mm: update min_free_kbytes from khugepaged after core initialization")
Signed-off-by: Doug Berger <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Jason Baron <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

squashfs: avoid bio_alloc() failure with 1Mbyte blocks

This is a regression introduced by the patch "migrate from ll_rw_block
usage to BIO".

Bio_alloc() is limited to 256 pages (1 Mbyte). This can cause a failure
when reading 1 Mbyte block filesystems. The problem is a datablock can be
fully (or almost uncompressed), requiring 256 pages, but, because blocks
are not aligned to page boundaries, it may require 257 pages to read.

Bio_kmalloc() can handle 1024 pages, and so use this for the edge
condition.

Fixes: 93e72b3c612a ("squashfs: migrate from ll_rw_block usage to BIO")
Reported-by: Nicolas Prochazka <[email protected]>
Reported-by: Tomoatsu Shimada <[email protected]>
Signed-off-by: Phillip Lougher <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Guenter Roeck <[email protected]>
Cc: Philippe Liard <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Adrien Schildknecht <[email protected]>
Cc: Daniel Rosenberg <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

uprobes: __replace_page() avoid BUG in munlock_vma_page()

syzbot crashed on the VM_BUG_ON_PAGE(PageTail) in munlock_vma_page(), when
called from uprobes __replace_page(). Which of many ways to fix it?
Settled on not calling when PageCompound (since Head and Tail are equals
in this context, PageCompound the usual check in uprobes.c, and the prior
use of FOLL_SPLIT_PMD will have cleared PageMlocked already).

Fixes: 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT")
Reported-by: syzbot <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Srikar Dronamraju <[email protected]>
Acked-by: Song Liu <[email protected]>
Acked-by: Oleg Nesterov <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: <[email protected]> [5.4+]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

kernel/relay.c: fix memleak on destroy relay channel

kmemleak report memory leak as follows:

  unreferenced object 0x607ee4e5f948 (size 8):
  comm "syz-executor.1", pid 2098, jiffies 4295031601 (age 288.468s)
  hex dump (first 8 bytes):
  00 00 00 00 00 00 00 00 ........
  backtrace:
     relay_open kernel/relay.c:583 [inline]
     relay_open+0xb6/0x970 kernel/relay.c:563
     do_blk_trace_setup+0x4a8/0xb20 kernel/trace/blktrace.c:557
     __blk_trace_setup+0xb6/0x150 kernel/trace/blktrace.c:597
     blk_trace_ioctl+0x146/0x280 kernel/trace/blktrace.c:738
     blkdev_ioctl+0xb2/0x6a0 block/ioctl.c:613
     block_ioctl+0xe5/0x120 fs/block_dev.c:1871
     vfs_ioctl fs/ioctl.c:48 [inline]
     __do_sys_ioctl fs/ioctl.c:753 [inline]
     __se_sys_ioctl fs/ioctl.c:739 [inline]
     __x64_sys_ioctl+0x170/0x1ce fs/ioctl.c:739
     do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
     entry_SYSCALL_64_after_hwframe+0x44/0xa9

'chan->buf' is malloced in relay_open() by alloc_percpu() but not free
while destroy the relay channel.  Fix it by adding free_percpu() before
return from relay_destroy_channel().

Fixes: 017c59c042d0 ("relay: Use per CPU constructs for the relay channel buffer pointers")
Reported-by: Hulk Robot <[email protected]>
Signed-off-by: Wei Yongjun <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Chris Wilson <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Daniel Axtens <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Akash Goel <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

romfs: fix uninitialized memory leak in romfs_dev_read()

romfs has a superblock field that limits the size of the filesystem; data
beyond that limit is never accessed.

romfs_dev_read() fetches a caller-supplied number of bytes from the
backing device. It returns 0 on success or an error code on failure;
therefore, its API can't represent short reads, it's all-or-nothing.

However, when romfs_dev_read() detects that the requested operation would
cross the filesystem size limit, it currently silently truncates the
requested number of bytes. This e.g. means that when the content of a
file with size 0x1000 starts one byte before the filesystem size limit,
->readpage() will only fill a single byte of the supplied page while
leaving the rest uninitialized, leaking that uninitialized memory to
userspace.

Fix it by returning an error code instead of truncating the read when the
requested read operation would go beyond the end of the filesystem.

Fixes: da4458bda237 ("NOMMU: Make it possible for RomFS to use MTD devices directly")
Signed-off-by: Jann Horn <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Cc: David Howells <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

mm/rodata_test.c: fix missing function declaration

The compilation with CONFIG_DEBUG_RODATA_TEST set produces the following
warning due to the missing include.

mm/rodata_test.c:15:6: warning: no previous prototype for 'rodata_test' [-Wmissing-prototypes]
15 | void rodata_test(void)
| ^~~~~~~~~~~

Fixes: 2959a5f726f6 ("mm: add arch-independent testcases for RODATA")
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

mm/vunmap: add cond_resched() in vunmap_pmd_range

Like zap_pte_range add cond_resched so that we can avoid softlockups as
reported below.  On non-preemptible kernel with large I/O map region (like
the one we get when using persistent memory with sector mode), an unmap of
the namespace can report below softlockups.

22724.027334] watchdog: BUG: soft lockup - CPU#49 stuck for 23s! [ndctl:50777]
NIP [c0000000000dc224] plpar_hcall+0x38/0x58
LR [c0000000000d8898] pSeries_lpar_hpte_invalidate+0x68/0xb0
Call Trace:
    flush_hash_page+0x114/0x200
    hpte_need_flush+0x2dc/0x540
    vunmap_page_range+0x538/0x6f0
    free_unmap_vmap_area+0x30/0x70
    remove_vm_area+0xfc/0x140
    __vunmap+0x68/0x270
    __iounmap.part.0+0x34/0x60
    memunmap+0x54/0x70
    release_nodes+0x28c/0x300
    device_release_driver_internal+0x16c/0x280
    unbind_store+0x124/0x170
    drv_attr_store+0x44/0x60
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1b0/0x290
    __vfs_write+0x3c/0x70
    vfs_write+0xd8/0x260
    ksys_write+0xdc/0x130
    system_call+0x5c/0x70

Reported-by: Harish Sriram <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
Cc: <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

khugepaged: adjust VM_BUG_ON_MM() in __khugepaged_enter()

syzbot crashes on the VM_BUG_ON_MM(khugepaged_test_exit(mm), mm) in
__khugepaged_enter(): yes, when one thread is about to dump core, has set
core_state, and is waiting for others, another might do something calling
__khugepaged_enter(), which now crashes because I lumped the core_state
test (known as "mmget_still_valid") into khugepaged_test_exit(). I still
think it's best to lump them together, so just in this exceptional case,
check mm->mm_users directly instead of khugepaged_test_exit().

Fixes: bbe98f9cadff ("khugepaged: khugepaged_test_exit() check mmget_still_valid()")
Reported-by: syzbot <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Yang Shi <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Mike Kravetz <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: <[email protected]> [4.8+]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

hugetlb_cgroup: convert comma to semicolon

Replace a comma between expression statements by a semicolon.

Fixes: faced7e0806cf4 ("mm: hugetlb controller for cgroups v2")
Signed-off-by: Xu Wang <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Giuseppe Scrivano <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

mailmap: add Andi Kleen

I keep getting bounce back from the suse.de address.

Signed-off-by: Nick Desaulniers <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Quentin Perret <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Linus Torvalds <[email protected]>

core/entry: Respect syscall number rewrites

The transcript of the x86 entry code to the generic version failed to
reload the syscall number from ptregs after ptrace and seccomp have run,
which both can modify the syscall number in ptregs. It returns the original
syscall number instead which is obviously not the right thing to do.

Reload the syscall number to fix that.

Fixes: 142781e108b1 ("entry: Provide generic syscall entry functionality")
Reported-by: Kyle Huey <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Kyle Huey <[email protected]>
Tested-by: Kees Cook <[email protected]>
Acked-by: Kees Cook <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

x86/entry/64: Do not use RDPID in paranoid entry to accomodate KVM

KVM has an optmization to avoid expensive MRS read/writes on
VMENTER/EXIT. It caches the MSR values and restores them either when
leaving the run loop, on preemption or when going out to user space.

The affected MSRs are not required for kernel context operations. This
changed with the recently introduced mechanism to handle FSGSBASE in the
paranoid entry code which has to retrieve the kernel GSBASE value by
accessing per CPU memory. The mechanism needs to retrieve the CPU number
and uses either LSL or RDPID if the processor supports it.

Unfortunately RDPID uses MSR_TSC_AUX which is in the list of cached and
lazily restored MSRs, which means between the point where the guest value
is written and the point of restore, MSR_TSC_AUX contains a random number.

If an NMI or any other exception which uses the paranoid entry path happens
in such a context, then RDPID returns the random guest MSR_TSC_AUX value.

As a consequence this reads from the wrong memory location to retrieve the
kernel GSBASE value. Kernel GS is used to for all regular this_cpu_*()
operations. If the GSBASE in the exception handler points to the per CPU
memory of a different CPU then this has the obvious consequences of data
corruption and crashes.

As the paranoid entry path is the only place which accesses MSR_TSX_AUX
(via RDPID) and the fallback via LSL is not significantly slower, remove
the RDPID alternative from the entry path and always use LSL.

The alternative would be to write MSR_TSC_AUX on every VMENTER and VMEXIT
which would be inflicting massive overhead on that code path.

[ tglx: Rewrote changelog ]

Fixes: eaad981291ee3 ("x86/entry/64: Introduce the FIND_PERCPU_BASE macro")
Reported-by: Tom Lendacky <[email protected]>
Debugged-by: Tom Lendacky <[email protected]>
Suggested-by: Andy Lutomirski <[email protected]>
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

MAINTAINERS: Update Mellanox and Cumulus Network addresses to new domain

Mellanox and Cumulus Network were acquired by Nvidia, so change the
maintainers emails to new domain name.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Leon Romanovsky <[email protected]>
Acked-by: Andy Shevchenko <[email protected]>
Signed-off-by: Jason Gunthorpe <[email protected]>

powerpc/perf/hv-24x7: Move cpumask file to top folder of hv-24x7 driver

Commit 792f73f747b8 ("powerpc/hv-24x7: Add sysfs files inside hv-24x7
device to show cpumask") added cpumask file as part of hv-24x7 driver
inside the interface folder. The cpumask file is supposed to be in the
top folder of the PMU driver in order to make hotplug work.

This patch fixes that issue and creates new group 'cpumask_attr_group'
to add cpumask file and make sure it added in top folder.

command:# cat /sys/devices/hv_24x7/cpumask
0

Fixes: 792f73f747b8 ("powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask")
Signed-off-by: Kajol Jain <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

powerpc/32s: Fix module loading failure when VMALLOC_END is over 0xf0000000

In is_module_segment(), when VMALLOC_END is over 0xf0000000,
ALIGN(VMALLOC_END, SZ_256M) has value 0.

In that case, addr >= ALIGN(VMALLOC_END, SZ_256M) is always
true then is_module_segment() always returns false.

Use (ALIGN(VMALLOC_END, SZ_256M) - 1) which will have
value 0xffffffff and will be suitable for the comparison.

Fixes: c49643319715 ("powerpc/32s: Only leave NX unset on segments used for modules")
Reported-by: Andreas Schwab <[email protected]>
Signed-off-by: Christophe Leroy <[email protected]>
Tested-by: Andreas Schwab <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/09fc73fe9c7423c6b4cf93f93df9bb0ed8eefab5.1597994047.git.christophe.leroy@csgroup.eu

KVM: arm64: Print warning when cpu erratum can cause guests to deadlock

If guests don't have certain CPU erratum workarounds implemented, then
there is a possibility a guest can deadlock the system. IOW, only trusted
guests should be used on systems with the erratum.

This is the case for Cortex-A57 erratum 832075.

Signed-off-by: Rob Herring <[email protected]>
Acked-by: Will Deacon <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: James Morse <[email protected]>
Cc: Julien Thierry <[email protected]>
Cc: Suzuki K Poulose <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>

arm64: Allow booting of late CPUs affected by erratum 1418040

As we can now switch from a system that isn't affected by 1418040
to a system that globally is affected, let's allow affected CPUs
to come in at a later time.

Signed-off-by: Marc Zyngier <[email protected]>
Tested-by: Sai Prakash Ranjan <[email protected]>
Reviewed-by: Stephen Boyd <[email protected]>
Reviewed-by: Suzuki K Poulose <[email protected]>
Acked-by: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>