Petr Mladek [Wed, 15 Jun 2022 16:28:04 +0000 (18:28 +0200)]
printk: Block console kthreads when direct printing will be required
There are known situations when the console kthreads are not
reliable or does not work in principle, for example, early boot,
panic, shutdown.
For these situations there is the direct (legacy) mode when printk() tries
to get console_lock() and flush the messages directly. It works very well
during the early boot when the console kthreads are not available at all.
It gets more complicated in the other situations when console kthreads
might be actively printing and block console_trylock() in printk().
The same problem is in the legacy code as well. Any console_lock()
owner could block console_trylock() in printk(). It is solved by
a trick that the current console_lock() owner is responsible for
printing all pending messages. It is actually the reason why there
is the risk of softlockups and why the console kthreads were
introduced.
The console kthreads use the same approach. They are responsible
for printing the messages by definition. So that they handle
the messages anytime when they are awake and see new ones.
The global console_lock is available when there is nothing
to do.
It should work well when the problematic context is correctly
detected and printk() switches to the direct mode. But it seems
that it is not enough in practice. There are reports that
the messages are not printed during panic() or shutdown()
even though printk() tries to use the direct mode here.
The problem seems to be that console kthreads become active in these
situation as well. They steel the job before other CPUs are stopped.
Then they are stopped in the middle of the job and block the global
console_lock.
First part of the solution is to block console kthreads when
the system is in a problematic state and requires the direct
printk() mode.
Linus Torvalds [Wed, 15 Jun 2022 19:34:19 +0000 (12:34 -0700)]
Merge tag 'tpmdd-next-v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd
Pull tpm fixes from Jarkko Sakkinen:
"Two fixes for this merge window"
* tag 'tpmdd-next-v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
certs: fix and refactor CONFIG_SYSTEM_BLACKLIST_HASH_LIST build
certs/blacklist_hashes.c: fix const confusion in certs blacklist
Dave Wysochanski [Fri, 10 Jun 2022 00:46:29 +0000 (20:46 -0400)]
NFSv4: Add FMODE_CAN_ODIRECT after successful open of a NFS4.x file
Commit a2ad63daa88b ("VFS: add FMODE_CAN_ODIRECT file flag")
added the FMODE_CAN_ODIRECT flag for NFSv3 but neglected to add
it for NFSv4.x. This causes direct io on NFSv4.x to fail open
with EINVAL:
mount -o vers=4.2 127.0.0.1:/export /mnt/nfs4
dd if=/dev/zero of=/mnt/nfs4/file.bin bs=128k count=1 oflag=direct
dd: failed to open '/mnt/nfs4/file.bin': Invalid argument
dd of=/dev/null if=/mnt/nfs4/file.bin bs=128k count=1 iflag=direct
dd: failed to open '/mnt/dir1/file1.bin': Invalid argument
Fixes: a2ad63daa88b ("VFS: add FMODE_CAN_ODIRECT file flag") Signed-off-by: Dave Wysochanski <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
Tianyu Lan [Tue, 14 Jun 2022 01:45:53 +0000 (21:45 -0400)]
x86/Hyper-V: Add SEV negotiate protocol support in Isolation VM
Hyper-V Isolation VM current code uses sev_es_ghcb_hv_call()
to read/write MSR via GHCB page and depends on the sev code.
This may cause regression when sev code changes interface
design.
The latest SEV-ES code requires to negotiate GHCB version before
reading/writing MSR via GHCB page and sev_es_ghcb_hv_call() doesn't
work for Hyper-V Isolation VM. Add Hyper-V ghcb related implementation
to decouple SEV and Hyper-V code. Negotiate GHCB version in the
hyperv_init() and use the version to communicate with Hyper-V
in the ghcb hv call function.
After successful #VE handling, tdx_handle_virt_exception() has to move
RIP to the next instruction. The handler needs to know the length of the
instruction.
If the #VE happened due to instruction execution, the GET_VEINFO TDX
module call provides info on the instruction in R10, including its length.
For #VE due to EPT violation, the info in R10 is not populand and the
kernel must decode the instruction manually to find out its length.
Restructure the code to make it explicit that the instruction length
depends on the type of #VE. Make individual #VE handlers return
the instruction length on success or -errno on failure.
Jens Axboe [Wed, 15 Jun 2022 17:56:07 +0000 (11:56 -0600)]
Merge branch 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-5.19
Pull MD fixes from Song.
* 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
md/raid5-ppl: Fix argument order in bio_alloc_bioset()
Revert "md: don't unregister sync_thread with reconfig_mutex held"
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:46 +0000 (10:27 -0600)]
md/raid5-ppl: Fix argument order in bio_alloc_bioset()
bio_alloc_bioset() takes a block device, number of vectors, the
OP flags, the GFP mask and the bio set. However when the prototype
was changed, the callisite in ppl_do_flush() had the OP flags and
the GFP flags reversed. This introduced some sparse error:
drivers/md/raid5-ppl.c:632:57: warning: incorrect type in argument 3
(different base types)
drivers/md/raid5-ppl.c:632:57: expected unsigned int opf
drivers/md/raid5-ppl.c:632:57: got restricted gfp_t [usertype]
drivers/md/raid5-ppl.c:633:61: warning: incorrect type in argument 4
(different base types)
drivers/md/raid5-ppl.c:633:61: expected restricted gfp_t [usertype]
gfp_mask
drivers/md/raid5-ppl.c:633:61: got unsigned long long
The sparse error introduction may not have been reported correctly by
0day due to other work that was cleaning up other sparse errors in this
area.
bpf: Limit maximum modifier chain length in btf_check_type_tags
On processing a module BTF of module built for an older kernel, we might
sometimes find that some type points to itself forming a loop. If such a
type is a modifier, btf_check_type_tags's while loop following modifier
chain will be caught in an infinite loop.
Fix this by defining a maximum chain length and bailing out if we spin
any longer than that.
then mddev->reshape_position is set to MaxSector in raid5_finish_reshape
since MD_RECOVERY_INTR is cleared in md_check_recovery, which means
feature_map is not set with MD_FEATURE_RESHAPE_ACTIVE and superblock's
reshape_position can't be updated accordingly.
Fixes: 8b48ec23cc51a ("md: don't unregister sync_thread with reconfig_mutex held") Reported-by: Logan Gunthorpe <[email protected]> Signed-off-by: Guoqing Jiang <[email protected]> Signed-off-by: Song Liu <[email protected]>
Mengqi Zhang [Thu, 9 Jun 2022 11:22:39 +0000 (19:22 +0800)]
mmc: mediatek: wait dma stop bit reset to 0
MediaTek IP requires that after dma stop, it need to wait this dma stop
bit auto-reset to 0. When bus is in high loading state, it will take a
while for the dma stop complete. If there is no waiting operation here,
when program runs to clear fifo and reset, bus will hang.
In addition, there should be no return in msdc_data_xfer_next() if
there is data need be transferred, because no matter what error occurs
here, it should continue to excute to the following mmc_request_done.
Otherwise the core layer may wait complete forever.
Linus Torvalds [Wed, 15 Jun 2022 16:04:55 +0000 (09:04 -0700)]
Merge tag 'fs.fixes.v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull vfs idmapping fix from Christian Brauner:
"This fixes an issue where we fail to change the group of a file when
the caller owns the file and is a member of the group to change to.
This is only relevant on idmapped mounts.
There's a detailed description in the commit message and regression
tests have been added to xfstests"
* tag 'fs.fixes.v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
fs: account for group membership
After commit 82f6cdcc3676c ("dm: switch dm_io booleans over to proper
flags") dm_start_io_acct stopped atomically checking and setting
was_accounted, which turned into the DM_IO_ACCOUNTED flag. This opened
the possibility for a race where IO accounting is started twice for
duplicate bios. To remove the race, check the flag while holding the
io->lock.
Jens Axboe [Wed, 15 Jun 2022 15:39:05 +0000 (09:39 -0600)]
Merge tag 'nvme-5.19-2022-06-15' of git://git.infradead.org/nvme into block-5.19
Pull NVMe fixes from Christoph:
"nvme fixes for Linux 5.19
- quirks, quirks, quirks to work around buggy consumer grade devices
(Keith Bush, Ning Wang, Stefan Reiter, Rasheed Hsueh)
- better kernel messages for devices that need quirking (Keith Bush)
- make a kernel message more useful (Thomas Weißschuh)"
* tag 'nvme-5.19-2022-06-15' of git://git.infradead.org/nvme:
nvme-pci: disable write zeros support on UMIC and Samsung SSDs
nvme-pci: avoid the deepest sleep state on ZHITAI TiPro7000 SSDs
nvme-pci: sk hynix p31 has bogus namespace ids
nvme-pci: smi has bogus namespace ids
nvme-pci: phison e12 has bogus namespace ids
nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA XPG GAMMIX S50
nvme-pci: add trouble shooting steps for timeouts
nvme: add bug report info for global duplicate id
nvme: add device name to warning in uuid_show()
Mark Rutland [Tue, 14 Jun 2022 08:09:43 +0000 (09:09 +0100)]
arm64: ftrace: consistently handle PLTs.
Sometimes it is necessary to use a PLT entry to call an ftrace
trampoline. This is handled by ftrace_make_call() and ftrace_make_nop(),
with each having *almost* identical logic, but this is not handled by
ftrace_modify_call() since its introduction in commit:
Due to this, if we ever were to call ftrace_modify_call() for a callsite
which requires a PLT entry for a trampoline, then either:
a) If the old addr requires a trampoline, ftrace_modify_call() will use
an out-of-range address to generate the 'old' branch instruction.
This will result in warnings from aarch64_insn_gen_branch_imm() and
ftrace_modify_code(), and no instructions will be modified. As
ftrace_modify_call() will return an error, this will result in
subsequent internal ftrace errors.
b) If the old addr does not require a trampoline, but the new addr does,
ftrace_modify_call() will use an out-of-range address to generate the
'new' branch instruction. This will result in warnings from
aarch64_insn_gen_branch_imm(), and ftrace_modify_code() will replace
the 'old' branch with a BRK. This will result in a kernel panic when
this BRK is later executed.
Practically speaking, case (a) is vastly more likely than case (b), and
typically this will result in internal ftrace errors that don't
necessarily affect the rest of the system. This can be demonstrated with
an out-of-tree test module which triggers ftrace_modify_call(), e.g.
We can solve this by consistently determining whether to use a PLT entry
for an address.
Note that since (the earlier) commit:
f1a54ae9af0da4d7 ("arm64: module/ftrace: intialize PLT at load time")
... we can consistently determine the PLT address that a given callsite
will use, and therefore ftrace_make_nop() does not need to skip
validation when a PLT is in use.
This patch factors the existing logic out of ftrace_make_call() and
ftrace_make_nop() into a common ftrace_find_callable_addr() helper
function, which is used by ftrace_make_call(), ftrace_make_nop(), and
ftrace_modify_call(). In ftrace_make_nop() the patching is consistently
validated by ftrace_modify_code() as we can always determine what the
old instruction should have been.
Mark Rutland [Tue, 14 Jun 2022 08:09:42 +0000 (09:09 +0100)]
arm64: ftrace: fix branch range checks
The branch range checks in ftrace_make_call() and ftrace_make_nop() are
incorrect, erroneously permitting a forwards branch of 128M and
erroneously rejecting a backwards branch of 128M.
This is because both functions calculate the offset backwards,
calculating the offset *from* the target *to* the branch, rather than
the other way around as the later comparisons expect.
If an out-of-range branch were erroeously permitted, this would later be
rejected by aarch64_insn_gen_branch_imm() as branch_imm_common() checks
the bounds correctly, resulting in warnings and the placement of a BRK
instruction. Note that this can only happen for a forwards branch of
exactly 128M, and so the caller would need to be exactly 128M bytes
below the relevant ftrace trampoline.
If an in-range branch were erroeously rejected, then:
* For modules when CONFIG_ARM64_MODULE_PLTS=y, this would result in the
use of a PLT entry, which is benign.
Note that this is the common case, as this is selected by
CONFIG_RANDOMIZE_BASE (and therefore RANDOMIZE_MODULE_REGION_FULL),
which distributions typically seelct. This is also selected by
CONFIG_ARM64_ERRATUM_843419.
* For modules when CONFIG_ARM64_MODULE_PLTS=n, this would result in
internal ftrace failures.
* For core kernel text, this would result in internal ftrace failues.
Note that for this to happen, the kernel text would need to be at
least 128M bytes in size, and typical configurations are smaller tha
this.
Fix this by calculating the offset *from* the branch *to* the target in
both functions.
The reverted patch was needed as a fix after commit f5bda35fba61
("random: use static branch for crng_ready()"). However, this was
already fixed by 60e5b2886b92 ("random: do not use jump labels before
they are initialized") and hence no longer necessary to initialise jump
labels before setup_machine_fdt().
Jon Maxwell [Wed, 15 Jun 2022 01:15:40 +0000 (11:15 +1000)]
bpf: Fix request_sock leak in sk lookup helpers
A customer reported a request_socket leak in a Calico cloud environment. We
found that a BPF program was doing a socket lookup with takes a refcnt on
the socket and that it was finding the request_socket but returning the parent
LISTEN socket via sk_to_full_sk() without decrementing the child request socket
1st, resulting in request_sock slab object leak. This patch retains the
existing behaviour of returning full socks to the caller but it also decrements
the child request_socket if one is present before doing so to prevent the leak.
Thanks to Curtis Taylor for all the help in diagnosing and testing this. And
thanks to Antoine Tenart for the reproducer and patch input.
v2 of this patch contains, refactor as per Daniel Borkmann's suggestions to
validate RCU flags on the listen socket so that it balances with bpf_sk_release()
and update comments as per Martin KaFai Lau's suggestion. One small change to
Daniels suggestion, put "sk = sk2" under "if (sk2 != sk)" to avoid an extra
instruction.
Duoming Zhou [Tue, 14 Jun 2022 09:25:57 +0000 (17:25 +0800)]
net: ax25: Fix deadlock caused by skb_recv_datagram in ax25_recvmsg
The skb_recv_datagram() in ax25_recvmsg() will hold lock_sock
and block until it receives a packet from the remote. If the client
doesn`t connect to server and calls read() directly, it will not
receive any packets forever. As a result, the deadlock will happen.
This patch replaces skb_recv_datagram() with an open-coded variant of it
releasing the socket lock before the __skb_wait_for_more_packets() call
and re-acquiring it after such call in order that other functions that
need socket lock could be executed.
what's more, the socket lock will be released only when recvmsg() will
block and that should produce nicer overall behavior.
Pavel Begunkov [Wed, 15 Jun 2022 10:23:05 +0000 (11:23 +0100)]
io_uring: fix ->extra{1,2} misuse
We don't really know the state of req->extra{1,2] fields in
__io_fill_cqe_req(), if an opcode handler is not aware of CQE32 option,
it never sets them up properly. Track the state of those fields with a
request flag.
Pavel Begunkov [Wed, 15 Jun 2022 10:23:04 +0000 (11:23 +0100)]
io_uring: fill extra big cqe fields from req
The only user of io_req_complete32()-like functions is cmd
requests. Instead of keeping the whole complete32 family, remove them
and provide the extras in already added for inline completions
req->extra{1,2}. When fill_cqe_res() finds CQE32 option enabled
it'll use those fields to fill a 32B cqe.
Pavel Begunkov [Wed, 15 Jun 2022 10:23:03 +0000 (11:23 +0100)]
io_uring: unite fill_cqe and the 32B version
We want just one function that will handle both normal cqes and 32B
cqes. Combine __io_fill_cqe_req() and __io_fill_cqe_req32(). It's still
not entirely correct yet, but saves us from cases when we fill an CQE of
a wrong size.
Pavel Begunkov [Wed, 15 Jun 2022 10:23:02 +0000 (11:23 +0100)]
io_uring: get rid of __io_fill_cqe{32}_req()
There are too many cqe filling helpers, kill __io_fill_cqe{32}_req(),
use __io_fill_cqe{32}_req_filled() instead, and then rename it. It'll
simplify fixing in following patches.
Problems observed:
======================================================================
1) Using ssh/sshfs. The remote sshd daemon can abort with the message:
"message authentication code incorrect"
This happens because the tcp message sent is corrupted during the
USB "Bulk out". The device calculate the tcp checksum and send a
valid tcp message to the remote sshd. Then the encryption detects
the error and aborts.
2) NETDEV WATCHDOG: ... (ax88179_178a): transmit queue 0 timed out
3) Stop normal work without any log message.
The "Bulk in" continue receiving packets normally.
The host sends "Bulk out" and the device responds with -ECONNRESET.
(The netusb.c code tx_complete ignore -ECONNRESET)
Under normal conditions these errors take days to happen and in
intense usage take hours.
A test with ping gives packet loss, showing that something is wrong:
ping -4 -s 462 {destination} # 462 = 512 - 42 - 8
Not all packets fail.
My guess is that the device tries to find another packet starting
at the extra byte and will fail or not depending on the next
bytes (old buffer content).
======================================================================
David S. Miller [Wed, 15 Jun 2022 08:15:33 +0000 (09:15 +0100)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2022-06-14
This series contains updates to ice driver only.
Michal fixes incorrect Tx timestamp offset calculation for E822 devices.
Roman enforces required VLAN filtering settings for double VLAN mode.
Przemyslaw fixes memory corruption issues with VFs by ensuring
queues are disabled in the error path of VF queue configuration and to
disabled VFs during reset.
====================
Takashi Iwai [Tue, 14 Jun 2022 05:48:31 +0000 (07:48 +0200)]
ALSA: hda/realtek: Apply fixup for Lenovo Yoga Duet 7 properly
It turned out that Lenovo shipped two completely different products
with the very same PCI SSID, where both require different quirks;
namely, Lenovo C940 has already the fixup for its speaker
(ALC298_FIXUP_LENOVO_SPK_VOLUME) with the PCI SSID 17aa:3818, while
Yoga Duet 7 has also the very same PCI SSID but requires a different
quirk, ALC287_FIXUP_YOGA7_14TIL_SPEAKERS.
Fortunately, both are with different codecs (C940 with ALC298 and Duet
7 with ALC287), hence we can apply different fixes by checking the
codec ID. This patch implements that special fixup function.
For easier handling, the internal function for applying a specific
fixup entry is exported as __snd_hda_apply_fixup(), so that it can be
called from the codec driver. The rest is simply calling it with a
different fixup ID depending on the codec ID.
Tyler Hicks [Thu, 26 May 2022 23:59:59 +0000 (18:59 -0500)]
9p: Fix refcounting during full path walks for fid lookups
Decrement the refcount of the parent dentry's fid after walking
each path component during a full path walk for a lookup. Failure to do
so can lead to fids that are not clunked until the filesystem is
unmounted, as indicated by this warning:
9pnet: found fid 3 not clunked
The improper refcounting after walking resulted in open(2) returning
-EIO on any directories underneath the mount point when using the virtio
transport. When using the fd transport, there's no apparent issue until
the filesytem is unmounted and the warning above is emitted to the logs.
In some cases, the user may not yet be attached to the filesystem and a
new root fid, associated with the user, is created and attached to the
root dentry before the full path walk is performed. Increment the new
root fid's refcount to two in that situation so that it can be safely
decremented to one after it is used for the walk operation. The new fid
will still be attached to the root dentry when
v9fs_fid_lookup_with_uid() returns so a final refcount of one is
correct/expected.
Linus Torvalds [Tue, 14 Jun 2022 17:36:11 +0000 (10:36 -0700)]
netfs: fix up netfs_inode_init() docbook comment
Commit e81fb4198e27 ("netfs: Further cleanups after struct netfs_inode
wrapper introduced") changed the argument types and names, and actually
updated the comment too (although that was thanks to David Howells, not
me: my original patch only changed the code).
But the comment fixup didn't go quite far enough, and didn't change the
argument name in the comment, resulting in
include/linux/netfs.h:314: warning: Function parameter or member 'ctx' not described in 'netfs_inode_init'
include/linux/netfs.h:314: warning: Excess function parameter 'inode' description in 'netfs_inode_init'
during htmldoc generation.
Fixes: e81fb4198e27 ("netfs: Further cleanups after struct netfs_inode wrapper introduced") Reported-by: Stephen Rothwell <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Mark Brown [Tue, 14 Jun 2022 12:10:45 +0000 (13:10 +0100)]
selftests: Fix clang cross compilation
Unlike GCC clang uses a single compiler image to support multiple target
architectures meaning that we can't simply rely on CROSS_COMPILE to select
the output architecture. Instead we must pass --target to the compiler to
tell it what to output, kselftest was not doing this so cross compilation
of kselftest using clang resulted in kselftest being built for the host
architecture.
More work is required to fix tests using custom rules but this gets the
bulk of things building.
Roman Li [Thu, 19 May 2022 18:41:16 +0000 (14:41 -0400)]
drm/amd/display: Cap OLED brightness per max frame-average luminance
[Why]
For OLED eDP the Display Manager uses max_cll value as a limit
for brightness control.
max_cll defines the content light luminance for individual pixel.
Whereas max_fall defines frame-average level luminance.
The user may not observe the difference in brightness in between
max_fall and max_cll.
That negatively impacts the user experience.
[How]
Use max_fall value instead of max_cll as a limit for brightness control.
Even though IORING_CLOSE_FD_AND_FILE_SLOT might save cycles for some
users, but it tries to do two things at a time and it's not clear how to
handle errors and what to return in a single result field when one part
fails and another completes well. Kill it for now.
CQE32 nops were used for debugging and benchmarking but it doesn't
target any real use case. Revert it, we can return it back if someone
finds a good way to use it.
Disable VF's RX/TX queues, when VIRTCHNL_OP_CONFIG_VSI_QUEUES fail.
Not disabling them might lead to scenario, where PF driver leaves VF
queues enabled, when VF's VSI failed queue config.
In this scenario VF should not have RX/TX queues enabled. If PF failed
to set up VF's queues, VF will reset due to TX timeouts in VF driver.
Initialize iterator 'i' to -1, so if error happens prior to configuring
queues then error path code will not disable queue 0. Loop that
configures queues will is using same iterator, so error path code will
only disable queues that were configured.
Fixes: 77ca27c41705 ("ice: add support for virtchnl_queue_select.[tx|rx]_queues bitmap") Suggested-by: Slawomir Laba <[email protected]> Signed-off-by: Przemyslaw Patynowski <[email protected]> Signed-off-by: Mateusz Palczewski <[email protected]> Tested-by: Konrad Jankowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]>
VLAN filtering features, that is C-Tag and S-Tag, in DVM mode must be
both enabled or disabled.
In case of turning off/on only one of the features, another feature must
be turned off/on automatically with issuing an appropriate message to
the kernel log.
Fixes: 1babaf77f49d ("ice: Advertise 802.1ad VLAN filtering and offloads for PF netdev") Signed-off-by: Roman Storozhenko <[email protected]> Co-developed-by: Anatolii Gerasymenko <[email protected]> Signed-off-by: Anatolii Gerasymenko <[email protected]> Tested-by: Gurucharan <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
Michal Michalik [Tue, 10 May 2022 11:03:43 +0000 (13:03 +0200)]
ice: Fix PTP TX timestamp offset calculation
The offset was being incorrectly calculated for E822 - that led to
collisions in choosing TX timestamp register location when more than
one port was trying to use timestamping mechanism.
In E822 one quad is being logically split between ports, so quad 0 is
having trackers for ports 0-3, quad 1 ports 4-7 etc. Each port should
have separate memory location for tracking timestamps. Due to error for
example ports 1 and 2 had been assigned to quad 0 with same offset (0),
while port 1 should have offset 0 and 1 offset 16.
Fix it by correctly calculating quad offset.
Fixes: 3a7496234d17 ("ice: implement basic E822 PTP support") Signed-off-by: Michal Michalik <[email protected]> Tested-by: Gurucharan <[email protected]> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <[email protected]>
Linus Torvalds [Tue, 14 Jun 2022 14:57:18 +0000 (07:57 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"While last week's pull request contained miscellaneous fixes for x86,
this one covers other architectures, selftests changes, and a bigger
series for APIC virtualization bugs that were discovered during 5.20
development. The idea is to base 5.20 development for KVM on top of
this tag.
ARM64:
- Properly reset the SVE/SME flags on vcpu load
- Fix a vgic-v2 regression regarding accessing the pending state of a
HW interrupt from userspace (and make the code common with vgic-v3)
- Fix access to the idreg range for protected guests
- Ignore 'kvm-arm.mode=protected' when using VHE
- Return an error from kvm_arch_init_vm() on allocation failure
- A bunch of small cleanups (comments, annotations, indentation)
RISC-V:
- Typo fix in arch/riscv/kvm/vmid.c
- Remove broken reference pattern from MAINTAINERS entry
x86-64:
- Fix error in page tables with MKTME enabled
- Dirty page tracking performance test extended to running a nested
guest
- Disable APICv/AVIC in cases that it cannot implement correctly"
[ This merge also fixes a misplaced end parenthesis bug introduced in
commit 3743c2f02517 ("KVM: x86: inhibit APICv/AVIC on changes to APIC
ID or APIC base") pointed out by Sean Christopherson ]
Link: https://lore.kernel.org/all/[email protected]/
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (34 commits)
KVM: selftests: Restrict test region to 48-bit physical addresses when using nested
KVM: selftests: Add option to run dirty_log_perf_test vCPUs in L2
KVM: selftests: Clean up LIBKVM files in Makefile
KVM: selftests: Link selftests directly with lib object files
KVM: selftests: Drop unnecessary rule for STATIC_LIBS
KVM: selftests: Add a helper to check EPT/VPID capabilities
KVM: selftests: Move VMX_EPT_VPID_CAP_AD_BITS to vmx.h
KVM: selftests: Refactor nested_map() to specify target level
KVM: selftests: Drop stale function parameter comment for nested_map()
KVM: selftests: Add option to create 2M and 1G EPT mappings
KVM: selftests: Replace x86_page_size with PG_LEVEL_XX
KVM: x86: SVM: fix nested PAUSE filtering when L0 intercepts PAUSE
KVM: x86: SVM: drop preempt-safe wrappers for avic_vcpu_load/put
KVM: x86: disable preemption around the call to kvm_arch_vcpu_{un|}blocking
KVM: x86: disable preemption while updating apicv inhibition
KVM: x86: SVM: fix avic_kick_target_vcpus_fast
KVM: x86: SVM: remove avic's broken code that updated APIC ID
KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base
KVM: x86: document AVIC/APICv inhibit reasons
KVM: x86/mmu: Set memory encryption "value", not "mask", in shadow PDPTRs
...
Originally the cq reservation was performed first, followed by the skb
allocation. Commit 675716400da6 ("xdp: fix possible cq entry leak")
reversed the order because at the time there was no mechanism available
to undo the cq reservation which could have led to possible cq entry leaks
in the event of skb allocation failure. However if the skb allocation is
performed first and the cq reservation then fails, the xsk skb destructor
is called which blindly adds the skb address to the already full cq leading
to undefined behavior.
This commit restores the original order (cq reservation followed by skb
allocation) and uses the xskq_prod_cancel helper to undo the cq reserve
in event of skb allocation failure.
Linus Torvalds [Tue, 14 Jun 2022 14:43:15 +0000 (07:43 -0700)]
Merge tag 'x86-bugs-2022-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 MMIO stale data fixes from Thomas Gleixner:
"Yet another hw vulnerability with a software mitigation: Processor
MMIO Stale Data.
They are a class of MMIO-related weaknesses which can expose stale
data by propagating it into core fill buffers. Data which can then be
leaked using the usual speculative execution methods.
Mitigations include this set along with microcode updates and are
similar to MDS and TAA vulnerabilities: VERW now clears those buffers
too"
* tag 'x86-bugs-2022-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation/mmio: Print SMT warning
KVM: x86/speculation: Disable Fill buffer clear within guests
x86/speculation/mmio: Reuse SRBDS mitigation for SBDS
x86/speculation/srbds: Update SRBDS mitigation selection
x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data
x86/speculation/mmio: Enable CPU Fill buffer clearing on idle
x86/bugs: Group MDS, TAA & Processor MMIO Stale Data mitigations
x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data
x86/speculation: Add a common function for MD_CLEAR mitigation update
x86/speculation/mmio: Enumerate Processor MMIO Stale Data bug
Documentation: Add documentation for Processor MMIO Stale Data
Petr Machata [Mon, 13 Jun 2022 12:50:17 +0000 (15:50 +0300)]
mlxsw: spectrum_cnt: Reorder counter pools
Both RIF and ACL flow counters use a 24-bit SW-managed counter address to
communicate which counter they want to bind.
In a number of Spectrum FW releases, binding a RIF counter is broken and
slices the counter index to 16 bits. As a result, on Spectrum-2 and above,
no more than about 410 RIF counters can be effectively used. This
translates to 205 netdevices for which L3 HW stats can be enabled. (This
does not happen on Spectrum-1, because there are fewer counters available
overall and the counter index never exceeds 16 bits.)
Binding counters to ACLs does not have this issue. Therefore reorder the
counter allocation scheme so that RIF counters come first and therefore get
lower indices that are below the 16-bit barrier.
Marek Szyprowski [Fri, 13 May 2022 08:31:05 +0000 (10:31 +0200)]
drm/exynos: mic: Rework initialization
Commit dd8b6803bc49 ("exynos: drm: dsi: Attach in_bridge in MIC driver")
moved Exynos MIC attaching from DSI to MIC driver. However the method
proposed there is incomplete and cannot really work. To properly attach
it to the bridge chain, access to the respective encoder is needed. The
Exynos MIC driver always attaches to the encoder created by the Exynos
DSI driver, so grab it via available helpers for getting access to the
CRTC and encoders. This also requires to change the order of driver
component binding to let DSI to be bound before MIC.
Fixes: dd8b6803bc49 ("exynos: drm: dsi: Attach in_bridge in MIC driver") Signed-off-by: Marek Szyprowski <[email protected]>
Fixed merge conflict. Signed-off-by: Inki Dae <[email protected]>
When calling setattr_prepare() to determine the validity of the
attributes the ia_{g,u}id fields contain the value that will be written
to inode->i_{g,u}id. This is exactly the same for idmapped and
non-idmapped mounts and allows callers to pass in the values they want
to see written to inode->i_{g,u}id.
When group ownership is changed a caller whose fsuid owns the inode can
change the group of the inode to any group they are a member of. When
searching through the caller's groups we need to use the gid mapped
according to the idmapped mount otherwise we will fail to change
ownership for unprivileged users.
Consider a caller running with fsuid and fsgid 1000 using an idmapped
mount that maps id 65534 to 1000 and 65535 to 1001. Consequently, a file
owned by 65534:65535 in the filesystem will be owned by 1000:1001 in the
idmapped mount.
The caller now requests the gid of the file to be changed to 1000 going
through the idmapped mount. In the vfs we will immediately map the
requested gid to the value that will need to be written to inode->i_gid
and place it in attr->ia_gid. Since this idmapped mount maps 65534 to
1000 we place 65534 in attr->ia_gid.
When we check whether the caller is allowed to change group ownership we
first validate that their fsuid matches the inode's uid. The
inode->i_uid is 65534 which is mapped to uid 1000 in the idmapped mount.
Since the caller's fsuid is 1000 we pass the check.
We now check whether the caller is allowed to change inode->i_gid to the
requested gid by calling in_group_p(). This will compare the passed in
gid to the caller's fsgid and search the caller's additional groups.
Since we're dealing with an idmapped mount we need to pass in the gid
mapped according to the idmapped mount. This is akin to checking whether
a caller is privileged over the future group the inode is owned by. And
that needs to take the idmapped mount into account. Note, all helpers
are nops without idmapped mounts.
The AMD XGbE driver currently counts the number of interrupts assigned
to the device by inspecting the pdev->resource array. Since commit a1a2b7125e10 ("of/platform: Drop static setup of IRQ resource from DT
core") removed IRQs from this array, the driver now attempts to get all
interrupts from 1 to -1U and gives up probing once it reaches an invalid
interrupt index.
Obtain the number of IRQs with platform_irq_count() instead.
Sergey Gorenko [Mon, 13 Jun 2022 12:38:54 +0000 (15:38 +0300)]
scsi: iscsi: Exclude zero from the endpoint ID range
The kernel returns an endpoint ID as r.ep_connect_ret.handle in the
iscsi_uevent. The iscsid validates a received endpoint ID and treats zero
as an error. The commit referenced in the fixes line changed the endpoint
ID range, and zero is always assigned to the first endpoint ID. So, the
first attempt to create a new iSER connection always fails.
rasheed.hsueh [Fri, 10 Jun 2022 06:27:34 +0000 (14:27 +0800)]
nvme-pci: disable write zeros support on UMIC and Samsung SSDs
Like commit 5611ec2b9814 ("nvme-pci: prevent SK hynix PC400 from using
Write Zeroes command"), UMIS and Samsung has the same issue:
[ 6305.633887] blk_update_request: operation not supported error,
dev nvme0n1, sector 340812032 op 0x9:(WRITE_ZEROES) flags 0x0
phys_seg 0 prio class 0
So also disable Write Zeroes command on UMIS and Samsung.
Ning Wang [Sun, 5 Jun 2022 20:36:48 +0000 (20:36 +0000)]
nvme-pci: avoid the deepest sleep state on ZHITAI TiPro7000 SSDs
When ZHITAI TiPro7000 SSDs entered deepest power state(ps4)
it has the same APST sleep problem as Kingston A2000.
by chance the system crashes and displays the same dmesg info:
As the Archlinux wiki suggest (enlat + exlat) < 25000 is fine
and my testing shows no system crashes ever since.
Therefore disabling the deepest power state will fix the APST sleep issue.
Stefan Reiter [Mon, 6 Jun 2022 13:01:29 +0000 (13:01 +0000)]
nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA XPG GAMMIX S50
ADATA XPG GAMMIX S50 drives report bogus eui64 values that appear to
be the same across drives in one system. Quirk them out so they are
not marked as "non globally unique" duplicates.
Keith Busch [Mon, 6 Jun 2022 16:53:17 +0000 (09:53 -0700)]
nvme-pci: add trouble shooting steps for timeouts
Many users have encountered IO timeouts with a CSTS value of 0xffffffff,
which indicates a failure to read the register. While there are various
potential causes for this observation, faulty NVMe APST has been the
culprit quite frequently. Add the recommended troubleshooting steps in
the error output when this condition occurs.
Keith Busch [Tue, 7 Jun 2022 15:30:29 +0000 (08:30 -0700)]
nvme: add bug report info for global duplicate id
The recent global id check is finding poorly implemented devices in the
wild. Include relavant device information in the output to help quicken
an appropriate quirk patch.
[ 00.000000] No UUID available providing old NGUID
New message:
[ 00.000000] block nvme0n1: No UUID available providing old NGUID
Fixes: d934f9848a77 ("nvme: provide UUID value to userspace") Signed-off-by: Thomas Weißschuh <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
usercopy: Make usercopy resilient against ridiculously large copies
If 'n' is so large that it's negative, we might wrap around and mistakenly
think that the copy is OK when it's not. Such a copy would probably
crash, but just doing the arithmetic in a more simple way lets us detect
and refuse this case.
vmalloc does not allocate a vm_struct for vm_map_ram() areas. That causes
us to deny usercopies from those areas. This affects XFS which uses
vm_map_ram() for its directories.
Fix this by calling find_vmap_area() instead of find_vm_area().
Kailang Yang [Mon, 13 Jun 2022 06:57:19 +0000 (14:57 +0800)]
ALSA: hda/realtek - ALC897 headset MIC no sound
There is not have Headset Mic verb table in BIOS default.
So, it will have recording issue from headset MIC.
Add the verb table value without jack detect. It will turn on Headset Mic.
Jann Horn [Wed, 8 Jun 2022 18:22:05 +0000 (20:22 +0200)]
mm/slub: add missing TID updates on slab deactivation
The fastpath in slab_alloc_node() assumes that c->slab is stable as long as
the TID stays the same. However, two places in __slab_alloc() currently
don't update the TID when deactivating the CPU slab.
If multiple operations race the right way, this could lead to an object
getting lost; or, in an even more unlikely situation, it could even lead to
an object being freed onto the wrong slab's freelist, messing up the
`inuse` counter and eventually causing a page to be freed to the page
allocator while it still contains slab objects.
(I haven't actually tested these cases though, this is just based on
looking at the code. Writing testcases for this stuff seems like it'd be
a pain...)
The race leading to state inconsistency is (all operations on the same CPU
and kmem_cache):
- task A: begin do_slab_free():
- read TID
- read pcpu freelist (==NULL)
- check `slab == c->slab` (true)
- [PREEMPT A->B]
- task B: begin slab_alloc_node():
- fastpath fails (`c->freelist` is NULL)
- enter __slab_alloc()
- slub_get_cpu_ptr() (disables preemption)
- enter ___slab_alloc()
- take local_lock_irqsave()
- read c->freelist as NULL
- get_freelist() returns NULL
- write `c->slab = NULL`
- drop local_unlock_irqrestore()
- goto new_slab
- slub_percpu_partial() is NULL
- get_partial() returns NULL
- slub_put_cpu_ptr() (enables preemption)
- [PREEMPT B->A]
- task A: finish do_slab_free():
- this_cpu_cmpxchg_double() succeeds()
- [CORRUPT STATE: c->slab==NULL, c->freelist!=NULL]
From there, the object on c->freelist will get lost if task B is allowed to
continue from here: It will proceed to the retry_load_slab label,
set c->slab, then jump to load_freelist, which clobbers c->freelist.
But if we instead continue as follows, we get worse corruption:
- task A: run __slab_free() on object from other struct slab:
- CPU_PARTIAL_FREE case (slab was on no list, is now on pcpu partial)
- task A: run slab_alloc_node() with NUMA node constraint:
- fastpath fails (c->slab is NULL)
- call __slab_alloc()
- slub_get_cpu_ptr() (disables preemption)
- enter ___slab_alloc()
- c->slab is NULL: goto new_slab
- slub_percpu_partial() is non-NULL
- set c->slab to slub_percpu_partial(c)
- [CORRUPT STATE: c->slab points to slab-1, c->freelist has objects
from slab-2]
- goto redo
- node_match() fails
- goto deactivate_slab
- existing c->freelist is passed into deactivate_slab()
- inuse count of slab-1 is decremented to account for object from
slab-2
At this point, the inuse count of slab-1 is 1 lower than it should be.
This means that if we free all allocated objects in slab-1 except for one,
SLUB will think that slab-1 is completely unused, and may free its page,
leading to use-after-free.
mm/slub: Move the stackdepot related allocation out of IRQ-off section.
The set_track() invocation in free_debug_processing() is invoked with
acquired slab_lock(). The lock disables interrupts on PREEMPT_RT and
this forbids to allocate memory which is done in stack_depot_save().
Split set_track() into two parts: set_track_prepare() which allocate
memory and set_track_update() which only performs the assignment of the
trace data structure. Use set_track_prepare() before disabling
interrupts.
[ [email protected]: make set_track() call set_track_update() instead of
open-coded assignments ]
Serge Semin [Fri, 10 Jun 2022 07:42:33 +0000 (10:42 +0300)]
i2c: designware: Use standard optional ref clock implementation
Even though the DW I2C controller reference clock source is requested by
the method devm_clk_get() with non-optional clock requirement the way the
clock handler is used afterwards has a pure optional clock semantic
(though in some circumstances we can get a warning about the clock missing
printed in the system console). There is no point in reimplementing that
functionality seeing the kernel clock framework already supports the
optional interface from scratch. Thus let's convert the platform driver to
using it.
Note by providing this commit we get to fix two problems. The first one
was introduced in commit c62ebb3d5f0d ("i2c: designware: Add support for
an interface clock"). It causes not having the interface clock (pclk)
enabled/disabled in case if the reference clock isn't provided. The second
problem was first introduced in commit b33af11de236 ("i2c: designware: Do
not require clock when SSCN and FFCN are provided"). Since that
modification the deferred probe procedure has been unsupported in case if
the interface clock isn't ready.
Fixes: c62ebb3d5f0d ("i2c: designware: Add support for an interface clock") Fixes: b33af11de236 ("i2c: designware: Do not require clock when SSCN and FFCN are provided") Signed-off-by: Serge Semin <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Acked-by: Jarkko Nikula <[email protected]> Signed-off-by: Wolfram Sang <[email protected]>
Jens Axboe [Mon, 13 Jun 2022 12:52:52 +0000 (06:52 -0600)]
Merge branch 'io_uring/io_uring-5.19' of https://github.com/isilence/linux into io_uring-5.19
Pull io_uring fixes from Pavel.
* 'io_uring/io_uring-5.19' of https://github.com/isilence/linux:
io_uring: fix double unlock for pbuf select
io_uring: kbuf: fix bug of not consuming ring buffer in partial io case
io_uring: openclose: fix bug of closing wrong fixed file
io_uring: fix not locked access to fixed buf table
io_uring: fix races with buffer table unregister
io_uring: fix races with file table unregister
We ran into multiple DMA TX errors while writing files over a network
block device running on top of a DMA-connected AXI Ethernet device on
64-bit RISC-V machines. The errors indicated that the DMA had fetched a
null descriptor and we found that the reason for this is that AXI DMA had
unexpectedly processed a partially updated tail descriptor pointer. To
fix it, we suggest that the driver should use one 64-bit write instead
of two 32-bit writes to perform such update if possible. For those
archectures where double-word load/stores are unavailable, e.g. 32-bit
archectures, force a driver probe failure if the driver finds 64-bit
capability on DMA.
====================
Andy Chiu [Mon, 13 Jun 2022 03:42:02 +0000 (11:42 +0800)]
net: axienet: Use iowrite64 to write all 64b descriptor pointers
According to commit f735c40ed93c ("net: axienet: Autodetect 64-bit DMA
capability") and AXI-DMA spec (pg021), on 64-bit capable dma, only
writing MSB part of tail descriptor pointer causes DMA engine to start
fetching descriptors. However, we found that it is true only if dma is in
idle state. In other words, dma would use a tailp even if it only has LSB
updated, when the dma is running.
The non-atomicity of this behavior could be problematic if enough
delay were introduced in between the 2 writes. For example, if an
interrupt comes right after the LSB write and the cpu spends long
enough time in the handler for the dma to get back into idle state by
completing descriptors, then the seconcd write to MSB would treat dma
to start fetching descriptors again. Since the descriptor next to the
one pointed by current tail pointer is not filled by the kernel yet,
fetching a null descriptor here causes a dma internal error and halt
the dma engine down.
We suggest that the dma engine should start process a 64-bit MMIO write
to the descriptor pointer only if ONE 32-bit part of it is written on all
states. Or we should restrict the use of 64-bit addressable dma on 32-bit
platforms, since those devices have no instruction to guarantee the write
to LSB and MSB part of tail pointer occurs atomically to the dma.
Andy Chiu [Mon, 13 Jun 2022 03:42:01 +0000 (11:42 +0800)]
net: axienet: make the 64b addresable DMA depends on 64b archectures
Currently it is not safe to config the IP as 64-bit addressable on 32-bit
archectures, which cannot perform a double-word store on its descriptor
pointers. The pointer is 64-bit wide if the IP is configured as 64-bit,
and the device would process the partially updated pointer on some
states if the pointer was updated via two store-words. To prevent such
condition, we force a probe fail if we discover that the IP has 64-bit
capability but it is not running on a 64-Bit kernel.
This is a series of patch (1/2). The next patch must be applied in order
to make 64b DMA safe on 64b archectures.
Dylan Yudaken [Mon, 13 Jun 2022 10:11:57 +0000 (03:11 -0700)]
io_uring: limit size of provided buffer ring
The type of head and tail do not allow more than 2^15 entries in a
provided buffer ring, so do not allow this.
At 2^16 while each entry can be indexed, there is no way to
disambiguate full vs empty.
Guangbin Huang [Sat, 11 Jun 2022 12:25:29 +0000 (20:25 +0800)]
net: hns3: fix tm port shapping of fibre port is incorrect after driver initialization
Currently in driver initialization process, driver will set shapping
parameters of tm port to default speed read from firmware. However, the
speed of SFP module may not be default speed, so shapping parameters of
tm port may be incorrect.
To fix this problem, driver sets new shapping parameters for tm port
after getting exact speed of SFP module in this case.
Fixes: 88d10bd6f730 ("net: hns3: add support for multiple media type") Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jie Wang [Sat, 11 Jun 2022 12:25:28 +0000 (20:25 +0800)]
net: hns3: fix PF rss size initialization bug
Currently hns3 driver misuses the VF rss size to initialize the PF rss size
in hclge_tm_vport_tc_info_update. So this patch fix it by checking the
vport id before initialization.
Fixes: 7347255ea389 ("net: hns3: refactor PF rss get APIs with new common rss get APIs") Signed-off-by: Jie Wang <[email protected]> Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Guangbin Huang [Sat, 11 Jun 2022 12:25:27 +0000 (20:25 +0800)]
net: hns3: restore tm priority/qset to default settings when tc disabled
Currently, settings parameters of schedule mode, dwrr, shaper of tm
priority or qset of one tc are only be set when tc is enabled, they are
not restored to the default settings when tc is disabled. It confuses
users when they cat tm_priority or tm_qset files of debugfs. So this
patch fixes it.
Fixes: 848440544b41 ("net: hns3: Add support of TX Scheduler & Shaper to HNS3 driver") Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jie Wang [Sat, 11 Jun 2022 12:25:26 +0000 (20:25 +0800)]
net: hns3: modify the ring param print info
Currently tx push is also a ring param. So the original ring param print
info in hns3_is_ringparam_changed should be adjusted.
Fixes: 07fdc163ac88 ("net: hns3: refactor hns3_set_ringparam()") Signed-off-by: Jie Wang <[email protected]> Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Jian Shen [Sat, 11 Jun 2022 12:25:25 +0000 (20:25 +0800)]
net: hns3: don't push link state to VF if unalive
It's unnecessary to push link state to unalive VF, and the VF will
query link state from PF when it being start works.
Fixes: 18b6e31f8bf4 ("net: hns3: PF add support for pushing link status to VFs") Signed-off-by: Jian Shen <[email protected]> Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Guangbin Huang [Sat, 11 Jun 2022 12:25:24 +0000 (20:25 +0800)]
net: hns3: set port base vlan tbl_sta to false before removing old vlan
When modify port base vlan, the port base vlan tbl_sta needs to set to
false before removing old vlan, to indicate this operation is not finish.
Fixes: c0f46de30c96 ("net: hns3: fix port base vlan add fail when concurrent with reset") Signed-off-by: Guangbin Huang <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Pavel Begunkov [Sun, 12 Jun 2022 13:31:38 +0000 (14:31 +0100)]
io_uring: fix double unlock for pbuf select
io_buffer_select(), which is the only caller of io_ring_buffer_select(),
fully handles locking, mutex unlock in io_ring_buffer_select() will lead
to double unlock.
Fixes: c7fb19428d67d ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Pavel Begunkov <[email protected]>
Hao Xu [Sat, 11 Jun 2022 12:29:52 +0000 (20:29 +0800)]
io_uring: kbuf: fix bug of not consuming ring buffer in partial io case
When we use ring-mapped provided buffer, we should consume it before
arm poll if partial io has been done. Otherwise the buffer may be used
by other requests and thus we lost the data.
Fixes: c7fb19428d67 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by: Hao Xu <[email protected]>
[pavel: 5.19 rebase] Signed-off-by: Pavel Begunkov <[email protected]>