Merge tag 'drm-fixes-2021-07-16' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Regular rc2 fixes though a bit more than usual at rc2 stage, people
must have been testing early or else some fixes from last week got a
bit laggy.
There is one larger change in the amd fixes to amalgamate some power
management code on the newer chips with the code from the older chips,
it should only affects chips where support was introduced in rc1 and
it should make future fixes easier to maintain probably a good idea to
merge it now.
Otherwise it's mostly fixes across the board.
dma-buf:
- Fix fence leak in sync_file_merge() error code
drm/panel:
- nt35510: Don't fail on DSI reads
fbdev:
- Avoid use-after-free by not deleting current video mode
ttm:
- Avoid NULL-ptr deref in ttm_range_man_fini()
vmwgfx:
- Fix a merge commit
qxl:
- fix a TTM regression
amdgpu:
- SR-IOV fixes
- RAS fixes
- eDP fixes
- SMU13 code unification to facilitate fixes in the future
- Add new renoir DID
- Yellow Carp fixes
- Beige Goby fixes
- Revert a bunch of TLB fixes that caused regressions
- Revert an LTTPR display regression
amdkfd
- Fix VRAM access regression
- SVM fixes
i915:
- Fix -EDEADLK handling regression
- Drop the page table optimisation"
* tag 'drm-fixes-2021-07-16' of git://anongit.freedesktop.org/drm/drm: (29 commits)
drm/amdgpu: add another Renoir DID
drm/ttm: add a check against null pointer dereference
drm/i915/gtt: drop the page table optimisation
drm/i915/gt: Fix -EDEADLK handling regression
drm/amd/pm: Add waiting for response of mode-reset message for yellow carp
Revert "drm/amdkfd: Add heavy-weight TLB flush after unmapping"
Revert "drm/amdgpu: Add table_freed parameter to amdgpu_vm_bo_update"
Revert "drm/amdkfd: Make TLB flush conditional on mapping"
Revert "drm/amdgpu: Fix warning of Function parameter or member not described"
Revert "drm/amdkfd: Add memory sync before TLB flush on unmap"
drm/amd/pm: Fix BACO state setting for Beige_Goby
drm/amdgpu: Restore msix after FLR
drm/amdkfd: Allow CPU access for all VRAM BOs
drm/amdgpu/display - only update eDP's backlight level when necessary
drm/amdkfd: handle fault counters on invalid address
drm/amdgpu: Correct the irq numbers for virtual crtc
drm/amd/display: update header file name
drm/amd/pm: drop smu_v13_0_1.c|h files for yellow carp
drm/amd/display: remove faulty assert
Revert "drm/amd/display: Always write repeater mode regardless of LTTPR"
...
Merge branch 'urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU fixes from Paul McKenney:
- fix regressions induced by a merge-window change in scheduler
semantics, which means that smp_processor_id() can no longer be used
in kthreads using simple affinity to bind themselves to a specific
CPU.
- fix a bug in Tasks Trace RCU that was thought to be strictly
theoretical. However, production workloads have started hitting this,
so these fixes need to be merged sooner rather than later.
- fix a minor printk()-format-mismatch issue introduced during the
merge window.
* 'urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
rcu: Fix pr_info() formats and values in show_rcu_gp_kthreads()
rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader()
rcu-tasks: Don't delete holdouts within trc_inspect_reader()
refscale: Avoid false-positive warnings in ref_scale_reader()
scftorture: Avoid false-positive warnings in scftorture_invoker()
dt-bindings: display: renesas,du: Make resets optional on R-Car H1
The "resets" property is not present on R-Car Gen1 SoCs.
Supporting it would require migrating from renesas,cpg-clocks to
renesas,cpg-mssr.
Reflect this in the DT bindings by removing the global "required:
resets". All SoCs that do have "resets" properties already have
SoC-specific rules making it required.
cadence-quadspi has a builtin Auto-HW polling funtionality using which
it keep tracks of completion of write operations. When Auto-HW polling
is enabled, it automatically initiates status register read operation,
until the flash clears its busy bit.
cadence-quadspi controller doesn't allow an address phase when
auto-polling the busy bit on the status register. Unlike SPI NOR
flashes, SPI NAND flashes do require the address of status register
when polling the busy bit using the read register operation. As
Auto-HW polling is enabled by default, cadence-quadspi returns a
timeout for every write operation after an indefinite amount of
polling on SPI NAND flashes.
Disable Auto-HW polling completely as the spi-nor core, spinand core,
etc. take care of polling the busy bit on their own.
spi: spi-cadence-quadspi: Fix division by zero warning
Fix below division by zero warning:
- The reason for dividing by zero is because the dummy bus width is zero,
but if the dummy n bytes is zero, it indicates that there is no data transfer,
so we can just return zero without doing any calculations.
[ 0.795337] Division by zero in kernel.
:
[ 0.834051] [<807fd40c>] (__div0) from [<804e1acc>] (Ldiv0+0x8/0x10)
[ 0.839097] [<805f0710>] (cqspi_exec_mem_op) from [<805edb4c>] (spi_mem_exec_op+0x3b0/0x3f8)
Paulo Alcantara [Fri, 16 Jul 2021 00:53:53 +0000 (21:53 -0300)]
cifs: do not share tcp sessions of dfs connections
Make sure that we do not share tcp sessions of dfs mounts when
mounting regular shares that connect to same server. DFS connections
rely on a single instance of tcp in order to do failover properly in
cifs_reconnect().
It turns out that the problem with the clang -Wimplicit-fallthrough
warning is not about the kernel source code, but about clang itself, and
that the warning is unusable until clang fixes its broken ways.
In particular, when you enable this warning for clang, you not only get
warnings about implicit fallthroughs. You also get this:
warning: fallthrough annotation in unreachable code [-Wimplicit-fallthrough]
which is completely broken becasue it
(a) doesn't even tell you where the problem is (seriously: no line
numbers, no filename, no nothing).
(b) is fundamentally broken anyway, because there are perfectly valid
reasons to have a fallthrough statement even if it turns out that
it can perhaps not be reached.
In the kernel, an example of that second case is code in the scheduler:
switch (state) {
case cpuset:
if (IS_ENABLED(CONFIG_CPUSETS)) {
cpuset_cpus_allowed_fallback(p);
state = possible;
break;
}
fallthrough;
case possible:
where if CONFIG_CPUSETS is enabled you actually never hit the
fallthrough case at all. But that in no way makes the fallthrough
wrong.
So the warning is completely broken, and enabling it for clang is a very
bad idea.
In the meantime, we can keep the gcc option enabled, and make the gcc
build use
-Wimplicit-fallthrough=5
which means that we will at least continue to require a proper
fallthrough statement, and that gcc won't silently accept the magic
comment versions. Because gcc does this all correctly, and while the odd
"=5" part is kind of obscure, it's documented in [1]:
"-Wimplicit-fallthrough=5 doesn’t recognize any comments as
fallthrough comments, only attributes disable the warning"
so if clang ever fixes its bad behavior we can try enabling it there again.
Merge tag 'pwm/for-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm
Pull pwm fixes from Thierry Reding:
"A couple of fixes from Uwe that I missed for v5.14-rc1"
* tag 'pwm/for-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm:
pwm: ep93xx: Ensure configuring period and duty_cycle isn't wrongly skipped
pwm: berlin: Ensure configuring period and duty_cycle isn't wrongly skipped
pwm: tiecap: Ensure configuring period and duty_cycle isn't wrongly skipped
pwm: spear: Ensure configuring period and duty_cycle isn't wrongly skipped
pwm: sprd: Ensure configuring period and duty_cycle isn't wrongly skipped
Steve French [Thu, 15 Jul 2021 04:32:09 +0000 (23:32 -0500)]
SMB3.1.1: fix mount failure to some servers when compression enabled
When sending the compression context to some servers, they rejected
the SMB3.1.1 negotiate protocol because they expect the compression
context to have a data length of a multiple of 8.
We have a few ref counters srv_count, ses_count and
tc_count which we use for ref counting. Added a WARN_ON
during the decrement of each of these counters to make
sure that they don't go below their minimum values.
Steve French [Wed, 14 Jul 2021 00:40:33 +0000 (19:40 -0500)]
cifs: fix missing null session check in mount
Although it is unlikely to be have ended up with a null
session pointer calling cifs_try_adding_channels in cifs_mount.
Coverity correctly notes that we are already checking for
it earlier (when we return from do_dfs_failover), so at
a minimum to clarify the code we should make sure we also
check for it when we exit the loop so we don't end up calling
cifs_try_adding_channels or mount_setup_tlink with a null
ses pointer.
Merge tag 'Wimplicit-fallthrough-clang-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull fallthrough fixes from Gustavo Silva:
"This fixes many fall-through warnings when building with Clang and
-Wimplicit-fallthrough, and also enables -Wimplicit-fallthrough for
Clang, globally.
It's also important to notice that since we have adopted the use of
the pseudo-keyword macro fallthrough, we also want to avoid having
more /* fall through */ comments being introduced. Contrary to GCC,
Clang doesn't recognize any comments as implicit fall-through markings
when the -Wimplicit-fallthrough option is enabled.
So, in order to avoid having more comments being introduced, we use
the option -Wimplicit-fallthrough=5 for GCC, which similar to Clang,
will cause a warning in case a code comment is intended to be used as
a fall-through marking. The patch for Makefile also enforces this.
We had almost 4,000 of these issues for Clang in the beginning, and
there might be a couple more out there when building some
architectures with certain configurations. However, with the recent
fixes I think we are in good shape and it is now possible to enable
the warning for Clang"
* tag 'Wimplicit-fallthrough-clang-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (27 commits)
Makefile: Enable -Wimplicit-fallthrough for Clang
powerpc/smp: Fix fall-through warning for Clang
dmaengine: mpc512x: Fix fall-through warning for Clang
usb: gadget: fsl_qe_udc: Fix fall-through warning for Clang
powerpc/powernv: Fix fall-through warning for Clang
MIPS: Fix unreachable code issue
MIPS: Fix fall-through warnings for Clang
ASoC: Mediatek: MT8183: Fix fall-through warning for Clang
power: supply: Fix fall-through warnings for Clang
dmaengine: ti: k3-udma: Fix fall-through warning for Clang
s390: Fix fall-through warnings for Clang
dmaengine: ipu: Fix fall-through warning for Clang
iommu/arm-smmu-v3: Fix fall-through warning for Clang
mmc: jz4740: Fix fall-through warning for Clang
PCI: Fix fall-through warning for Clang
scsi: libsas: Fix fall-through warning for Clang
video: fbdev: Fix fall-through warning for Clang
math-emu: Fix fall-through warning
cpufreq: Fix fall-through warning for Clang
drm/msm: Fix fall-through warning in msm_gem_new_impl()
...
# perf test "88: Check open filename arg using perf trace + vfs_getname"
The third of these leaks is related to evsel->priv fields of sycalls
never being deallocated.
This patch adds the function evlist__free_syscall_tp_fields which
iterates over all evsels in evlist, matching syscalls, and calling the
missing frees.
This new function is called at the end of trace__run, right before
calling evlist__delete.
# perf test "88: Check open filename arg using perf trace + vfs_getname"
The first of these leaks is related to struct trace fields never being
deallocated.
This patch adds the function trace__exit, which is called at the end of
cmd_trace, replacing the existing deallocation, which is now moved
inside the new function.
perf report: Free generated help strings for sort option
ASan reports the memory leak of the strings allocated by sort_help() when
running perf report.
This patch changes the returned pointer to char* (instead of const
char*), saves it in a temporary variable, and finally deallocates it at
function exit.
Dongliang Mu [Wed, 14 Jul 2021 09:13:22 +0000 (17:13 +0800)]
usb: hso: fix error handling code of hso_create_net_device
The current error handling code of hso_create_net_device is
hso_free_net_device, no matter which errors lead to. For example,
WARNING in hso_free_net_device [1].
Fix this by refactoring the error handling code of
hso_create_net_device by handling different errors by different code.
Subsystems affected by this patch series: mm (kasan, pagealloc, rmap,
hmm, and hugetlb), and hfs"
* emailed patches from Andrew Morton <[email protected]>:
mm/hugetlb: fix refs calculation from unaligned @vaddr
hfs: add lock nesting notation to hfs_find_init
hfs: fix high memory mapping in hfs_bnode_read
hfs: add missing clean-up in hfs_fill_super
lib/test_hmm: remove set but unused page variable
mm: fix the try_to_unmap prototype for !CONFIG_MMU
mm/page_alloc: further fix __alloc_pages_bulk() return value
mm/page_alloc: correct return value when failing at preparing
mm/page_alloc: avoid page allocator recursion with pagesets.lock held
Revert "mm/page_alloc: make should_fail_alloc_page() static"
kasan: fix build by including kernel.h
kasan: add memzero init for unaligned size at DEBUG
mm: move helper to check slub_debug_enabled
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
- Allow again loading KVM on 32-bit non-PAE builds
- Fixes for host SMIs on AMD
- Fixes for guest SMIs on AMD
- Fixes for selftests on s390 and ARM
- Fix memory leak
- Enforce no-instrumentation area on vmentry when hardware breakpoints
are in use.
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
KVM: selftests: smm_test: Test SMM enter from L2
KVM: nSVM: Restore nested control upon leaving SMM
KVM: nSVM: Fix L1 state corruption upon return from SMM
KVM: nSVM: Introduce svm_copy_vmrun_state()
KVM: nSVM: Check that VM_HSAVE_PA MSR was set before VMRUN
KVM: nSVM: Check the value written to MSR_VM_HSAVE_PA
KVM: SVM: Fix sev_pin_memory() error checks in SEV migration utilities
KVM: SVM: Return -EFAULT if copy_to_user() for SEV mig packet header fails
KVM: SVM: add module param to control the #SMI interception
KVM: SVM: remove INIT intercept handler
KVM: SVM: #SMI interception must not skip the instruction
KVM: VMX: Remove vmx_msr_index from vmx.h
KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run()
KVM: selftests: Address extra memslot parameters in vm_vaddr_alloc
kvm: debugfs: fix memory leak in kvm_create_vm_debugfs
KVM: x86/pmu: Clear anythread deprecated bit when 0xa leaf is unsupported on the SVM
KVM: mmio: Fix use-after-free Read in kvm_vm_ioctl_unregister_coalesced_mmio
KVM: SVM: Revert clearing of C-bit on GPA in #NPF handler
KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs
KVM: x86: Use kernel's x86_phys_bits to handle reduced MAXPHYADDR
...
spi: spi-cadence-quadspi: Fix division by zero warning
Fix below division by zero warning:
- Added an if statement because buswidth can be zero, resulting in division by zero.
- The modified code was based on another driver (atmel-quadspi).
[ 0.795337] Division by zero in kernel.
:
[ 0.834051] [<807fd40c>] (__div0) from [<804e1acc>] (Ldiv0+0x8/0x10)
[ 0.839097] [<805f0710>] (cqspi_exec_mem_op) from [<805edb4c>] (spi_mem_exec_op+0x3b0/0x3f8)
Merge tag 'iommu-fixes-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
- Revert a patch which caused boot failures with QCOM IOMMU
- Two fixes for Intel VT-d context table handling
- Physical address decoding fix for Rockchip IOMMU
- Add a reviewer for AMD IOMMU
* tag 'iommu-fixes-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
MAINTAINERS: Add Suravee Suthikulpanit as Reviewer for AMD IOMMU (AMD-Vi)
iommu/rockchip: Fix physical address decoding
iommu/vt-d: Fix clearing real DMA device's scalable-mode context entries
iommu/vt-d: Global devTLB flush when present context entry changed
iommu/qcom: Revert "iommu/arm: Cleanup resources in case of probe error path"
Ziyang Xuan [Thu, 15 Jul 2021 12:22:04 +0000 (20:22 +0800)]
net: fix uninit-value in caif_seqpkt_sendmsg
When nr_segs equal to zero in iovec_from_user, the object
msg->msg_iter.iov is uninit stack memory in caif_seqpkt_sendmsg
which is defined in ___sys_sendmsg. So we cann't just judge
msg->msg_iter.iov->base directlly. We can use nr_segs to judge
msg in caif_seqpkt_sendmsg whether has data buffers.
Jakub Sitnicki [Wed, 14 Jul 2021 15:47:50 +0000 (17:47 +0200)]
bpf, sockmap, udp: sk_prot needs inuse_idx set for proc stats
The proc socket stats use sk_prot->inuse_idx value to record inuse sock
stats. We currently do not set this correctly from sockmap side. The
result is reading sock stats '/proc/net/sockstat' gives incorrect values.
The socket counter is incremented correctly, but because we don't set the
counter correctly when we replace sk_prot we may omit the decrement.
To get the correct inuse_idx value move the core_initcall that initializes
the UDP proto handlers to late_initcall. This way it is initialized after
UDP has the chance to assign the inuse_idx value from the register protocol
handler.
John Fastabend [Mon, 12 Jul 2021 19:55:46 +0000 (12:55 -0700)]
bpf, sockmap, tcp: sk_prot needs inuse_idx set for proc stats
The proc socket stats use sk_prot->inuse_idx value to record inuse sock
stats. We currently do not set this correctly from sockmap side. The
result is reading sock stats '/proc/net/sockstat' gives incorrect values.
The socket counter is incremented correctly, but because we don't set the
counter correctly when we replace sk_prot we may omit the decrement.
To get the correct inuse_idx value move the core_initcall that initializes
the TCP proto handlers to late_initcall. This way it is initialized after
TCP has the chance to assign the inuse_idx value from the register protocol
handler.
John Fastabend [Mon, 12 Jul 2021 19:55:45 +0000 (12:55 -0700)]
bpf, sockmap: Fix potential memory leak on unlikely error case
If skb_linearize is needed and fails we could leak a msg on the error
handling. To fix ensure we kfree the msg block before returning error.
Found during code review.
Colin Ian King [Thu, 15 Jul 2021 12:57:12 +0000 (13:57 +0100)]
s390/bpf: Perform r1 range checking before accessing jit->seen_reg[r1]
Currently array jit->seen_reg[r1] is being accessed before the range
checking of index r1. The range changing on r1 should be performed
first since it will avoid any potential out-of-range accesses on the
array seen_reg[] and also it is more optimal to perform checks on r1
before fetching data from the array. Fix this by swapping the order
of the checks before the array access.
Tracepoint trace_qdisc_enqueue() is introduced to trace skb at
the entrance of TC layer on TX side. This is similar to
trace_qdisc_dequeue():
1. For both we only trace successful cases. The failure cases
can be traced via trace_kfree_skb().
2. They are called at entrance or exit of TC layer, not for each
->enqueue() or ->dequeue(). This is intentional, because
we want to make trace_qdisc_enqueue() symmetric to
trace_qdisc_dequeue(), which is easier to use.
The return value of qdisc_enqueue() is not interesting here,
we have Qdisc's drop packets in ->dequeue(), it is impossible to
trace them even if we have the return value, the only way to trace
them is tracing kfree_skb().
We only add information we need to trace ring buffer. If any other
information is needed, it is easy to extend it without breaking ABI,
see commit 3dd344ea84e1 ("net: tracepoint: exposing sk_family in all
tcp:tracepoints").
net: use %px to print skb address in trace_netif_receive_skb
The print format of skb adress in tracepoint class net_dev_template
is changed to %px from %p, because we want to use skb address
as a quick way to identify a packet.
Note, trace ring buffer is only accessible to privileged users,
it is safe to use a real kernel address here.
Colin Ian King [Wed, 14 Jul 2021 15:23:43 +0000 (16:23 +0100)]
liquidio: Fix unintentional sign extension issue on left shift of u16
Shifting the u16 integer oct->pcie_port by CN23XX_PKT_INPUT_CTL_MAC_NUM_POS
(29) bits will be promoted to a 32 bit signed int and then sign-extended
to a u64. In the cases where oct->pcie_port where bit 2 is set (e.g. 3..7)
the shifted value will be sign extended and the top 32 bits of the result
will be set.
Fix this by casting the u16 values to a u64 before the 29 bit left shift.
Addresses-Coverity: ("Unintended sign extension")
Fixes: 3451b97cce2d ("liquidio: CN23XX register setup") Signed-off-by: Colin Ian King <[email protected]> Signed-off-by: David S. Miller <[email protected]>
mm/hugetlb: fix refs calculation from unaligned @vaddr
Commit 82e5d378b0e47 ("mm/hugetlb: refactor subpage recording")
refactored the count of subpages but missed an edge case when @vaddr is
not aligned to PAGE_SIZE e.g. when close to vma->vm_end. It would then
errousnly set @refs to 0 and record_subpages_vmas() wouldn't set the
@pages array element to its value, consequently causing the reported
null-deref by syzbot.
Fix it by aligning down @vaddr by PAGE_SIZE in @refs calculation.
This happens due to missing lock nesting information. From the logs, we
see that a call to hfs_fill_super is made to mount the hfs filesystem.
While searching for the root inode, the lock on the catalog btree is
grabbed. Then, when the parent of the root isn't found, a call to
__hfs_bnode_create is made to create the parent of the root. This
eventually leads to a call to hfs_ext_read_extent which grabs a lock on
the extents btree.
Since the order of locking is catalog btree -> extents btree, this lock
hierarchy does not lead to a deadlock.
To tell lockdep that this locking is safe, we add nesting notation to
distinguish between catalog btrees, extents btrees, and attributes
btrees (for HFS+). This has already been done in hfsplus.
Pages that we read in hfs_bnode_read need to be kmapped into kernel
address space. However, currently only the 0th page is kmapped. If the
given offset + length exceeds this 0th page, then we have an invalid
memory access.
To fix this, we kmap relevant pages one by one and copy their relevant
portions of data.
An example of invalid memory access occurring without this fix can be seen
in the following crash report:
==================================================================
BUG: KASAN: use-after-free in memcpy include/linux/fortify-string.h:191 [inline]
BUG: KASAN: use-after-free in hfs_bnode_read+0xc4/0xe0 fs/hfs/bnode.c:26
Read of size 2 at addr ffff888125fdcffe by task syz-executor5/4634
This series ultimately aims to address a lockdep warning in
hfs_find_init reported by Syzbot [1].
The work done for this led to the discovery of another bug, and the
Syzkaller repro test also reveals an invalid memory access error after
clearing the lockdep warning. Hence, this series is broken up into
three patches:
1. Add a missing call to hfs_find_exit for an error path in
hfs_fill_super
2. Fix memory mapping in hfs_bnode_read by fixing calls to kmap
3. Add lock nesting notation to tell lockdep that the observed locking
hierarchy is safe
This patch (of 3):
Before exiting hfs_fill_super, the struct hfs_find_data used in
hfs_find_init should be passed to hfs_find_exit to be cleaned up, and to
release the lock held on the btree.
The call to hfs_find_exit is missing from an error path. We add it back
in by consolidating calls to hfs_find_exit for error paths.
The HMM selftests use atomic_check_access() to check atomic access to a
page has been revoked. It doesn't matter if the page mapping has been
removed from the mirrored page tables as that also implies atomic access
has been revoked. Therefore remove the unused page variable to fix this
compiler warning:
lib/test_hmm.c:631:16: warning: variable `page' set but not used [-Wunused-but-set-variable]
mm: fix the try_to_unmap prototype for !CONFIG_MMU
Adjust the nommu stub of try_to_unmap to match the changed protype for the
full version. Turn it into an inline instead of a macro to generally
improve the type checking.
Chuck Lever [Thu, 15 Jul 2021 04:26:52 +0000 (21:26 -0700)]
mm/page_alloc: further fix __alloc_pages_bulk() return value
The author of commit b3b64ebd3822 ("mm/page_alloc: do bulk array
bounds check after checking populated elements") was possibly
confused by the mixture of return values throughout the function.
The API contract is clear that the function "Returns the number of pages
on the list or array." It does not list zero as a unique return value with
a special meaning. Therefore zero is a plausible return value only if
@nr_pages is zero or less.
Clean up the return logic to make it clear that the returned value is
always the total number of pages in the array/list, not the number of
pages that were allocated during this call.
The only change in behavior with this patch is the value returned if
prepare_alloc_pages() fails. To match the API contract, the number of
pages currently in the array/list is returned in this case.
The call site in __page_pool_alloc_pages_slow() also seems to be confused
on this matter. It should be attended to by someone who is familiar with
that code.
There are a number of ways it could be fixed. The page owner code could
be audited to strip GFP flags that allow sleeping but it'll impair the
functionality of PAGE_OWNER if allocations fail. The bulk allocator could
add a special case to release/reacquire the lock for prep_new_page and
lookup PCP after the lock is reacquired at the cost of performance. The
pages requiring prep could be tracked using the least significant bit and
looping through the array although it is more complicated for the list
interface. The options are relatively complex and the second one still
incurs a performance penalty when PAGE_OWNER is active so this patch takes
the simple approach -- disable bulk allocation of PAGE_OWNER is active.
The caller will be forced to allocate one page at a time incurring a
performance penalty but PAGE_OWNER is already a performance penalty.
Marco Elver [Thu, 15 Jul 2021 04:26:40 +0000 (21:26 -0700)]
kasan: fix build by including kernel.h
The <linux/kasan.h> header relies on _RET_IP_ being defined, and had been
receiving that definition via inclusion of bug.h which includes kernel.h.
However, since f39650de687e ("kernel.h: split out panic and oops helpers")
that is no longer the case and get the following build error when building
CONFIG_KASAN_HW_TAGS on arm64:
In file included from arch/arm64/mm/kasan_init.c:10:
include/linux/kasan.h: In function 'kasan_slab_free':
include/linux/kasan.h:230:39: error: '_RET_IP_' undeclared (first use in this function)
230 | return __kasan_slab_free(s, object, _RET_IP_, init);
net: dsa: mv88e6xxx: NET_DSA_MV88E6XXX_PTP should depend on NET_DSA_MV88E6XXX
Making global2 support mandatory removed the Kconfig symbol
NET_DSA_MV88E6XXX_GLOBAL2. This symbol also served as an intermediate
symbol to make NET_DSA_MV88E6XXX_PTP depend on NET_DSA_MV88E6XXX. With
the symbol removed, the user is always asked about PTP support for
Marvell 88E6xxx switches, even if the latter support is not enabled.
Fix this by reinstating the dependency.
Fixes: 63368a7416df144b ("net: dsa: mv88e6xxx: Make global2 support mandatory") Signed-off-by: Geert Uytterhoeven <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Reviewed-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
If we encounter a directory that has been configured to pass on an
extent size hint to a new realtime file and the hint isn't an integer
multiple of the rt extent size, we should flag the hint for
administrative review because that is a misconfiguration (that other
parts of the kernel will fix automatically).
Darrick J. Wong [Mon, 12 Jul 2021 19:58:49 +0000 (12:58 -0700)]
xfs: fix an integer overflow error in xfs_growfs_rt
During a realtime grow operation, we run a single transaction for each
rt bitmap block added to the filesystem. This means that each step has
to be careful to increase sb_rblocks appropriately.
Fix the integer overflow error in this calculation that can happen when
the extent size is very large. Found by running growfs to add a rt
volume to a filesystem formatted with a 1g rt extent size.
Darrick J. Wong [Mon, 12 Jul 2021 19:58:48 +0000 (12:58 -0700)]
xfs: improve FSGROWFSRT precondition checking
Improve the checking at the start of a realtime grow operation so that
we avoid accidentally set a new extent size that is too large and avoid
adding an rt volume to a filesystem with rmap or reflink because we
don't support rt rmap or reflink yet.
While we're at it, separate the checks so that we're only testing one
aspect at a time.
Darrick J. Wong [Mon, 12 Jul 2021 19:58:51 +0000 (12:58 -0700)]
xfs: don't expose misaligned extszinherit hints to userspace
Commit 603f000b15f2 changed xfs_ioctl_setattr_check_extsize to reject an
attempt to set an EXTSZINHERIT extent size hint on a directory with
RTINHERIT set if the hint isn't a multiple of the realtime extent size.
However, I have recently discovered that it is possible to change the
realtime extent size when adding a rt device to a filesystem, which
means that the existence of directories with misaligned inherited hints
is not an accident.
As a result, it's possible that someone could have set a valid hint and
added an rt volume with a different rt extent size, which invalidates
the ondisk hints. After such a sequence, FSGETXATTR will report a
misaligned hint, which FSSETXATTR will trip over, causing confusion if
the user was doing the usual GET/SET sequence to change some other
attribute. Change xfs_fill_fsxattr to omit the hint if it isn't aligned
properly.
Fixes: 603f000b15f2 ("xfs: validate extsz hints against rt extent size when rtinherit is set") Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
Darrick J. Wong [Mon, 12 Jul 2021 19:58:50 +0000 (12:58 -0700)]
xfs: correct the narrative around misaligned rtinherit/extszinherit dirs
While auditing the realtime growfs code, I realized that the GROWFSRT
ioctl (and by extension xfs_growfs) has always allowed sysadmins to
change the realtime extent size when adding a realtime section to the
filesystem. Since we also have always allowed sysadmins to set
RTINHERIT and EXTSZINHERIT on directories even if there is no realtime
device, this invalidates the premise laid out in the comments added in
commit 603f000b15f2.
In other words, this is not a case of inadequate metadata validation.
This is a case of nearly forgotten (and apparently untested) but
supported functionality. Update the comments to reflect what we've
learned, and remove the log message about correcting the misalignment.
Fixes: 603f000b15f2 ("xfs: validate extsz hints against rt extent size when rtinherit is set") Signed-off-by: Darrick J. Wong <[email protected]> Reviewed-by: Carlos Maiolino <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]>
Darrick J. Wong [Mon, 12 Jul 2021 19:58:48 +0000 (12:58 -0700)]
xfs: reset child dir '..' entry when unlinking child
While running xfs/168, I noticed a second source of post-shrink
corruption errors causing shutdowns.
Let's say that directory B has a low inode number and is a child of
directory A, which has a high number. If B is empty but open, and
unlinked from A, B's dotdot link continues to point to A. If A is then
unlinked and the filesystem shrunk so that A is no longer a valid inode,
a subsequent AIL push of B will trip the inode verifiers because the
dotdot entry points outside of the filesystem.
To avoid this problem, reset B's dotdot entry to the root directory when
unlinking directories, since the root directory cannot be removed.
Darrick J. Wong [Mon, 12 Jul 2021 19:58:47 +0000 (12:58 -0700)]
xfs: check for sparse inode clusters that cross new EOAG when shrinking
While running xfs/168, I noticed occasional write verifier shutdowns
involving inodes at the very end of the filesystem. Existing inode
btree validation code checks that all inode clusters are fully contained
within the filesystem.
However, due to inadequate checking in the fs shrink code, it's possible
that there could be a sparse inode cluster at the end of the filesystem
where the upper inodes of the cluster are marked as holes and the
corresponding blocks are free. In this case, the last blocks in the AG
are listed in the bnobt. This enables the shrink to proceed but results
in a filesystem that trips the inode verifiers. Fix this by disallowing
the shrink.
iomap: Don't create iomap_page objects for inline files
In iomap_readpage_actor, don't create iop objects for inline inodes.
Otherwise, iomap_read_inline_data will set PageUptodate without setting
iop->uptodate, and iomap_page_release will eventually complain.
To prevent this kind of bug from occurring in the future, make sure the
page doesn't have private data attached in iomap_read_inline_data.
iomap: Permit pages without an iop to enter writeback
Create an iop in the writeback path if one doesn't exist. This allows us
to avoid creating the iop in some cases. We'll initially do that for pages
with inline data, but it can be extended to pages which are entirely within
an extent. It also allows for an iop to be removed from pages in the
future (eg page split).
iomap: remove the length variable in iomap_seek_hole
The length variable is rather pointless given that it can be trivially
deduced from offset and size. Also the initial calculation can lead
to KASAN warnings.
iomap: remove the length variable in iomap_seek_data
The length variable is rather pointless given that it can be trivially
deduced from offset and size. Also the initial calculation can lead
to KASAN warnings.
Mark Rutland [Thu, 15 Jul 2021 12:30:49 +0000 (13:30 +0100)]
arm64: entry: fix KCOV suppression
We suppress KCOV for entry.o rather than entry-common.o. As entry.o is
built from entry.S, this is pointless, and permits instrumentation of
entry-common.o, which is built from entry-common.c.
Fix the Makefile to suppress KCOV for entry-common.o, as we had intended
to begin with. I've verified with objdump that this is working as
expected.
Mark Rutland [Wed, 14 Jul 2021 17:28:01 +0000 (18:28 +0100)]
arm64: entry: add missing noinstr
We intend that all the early exception handling code is marked as
`noinstr`, but we forgot this for __el0_error_handler_common(), which is
called before we have completed entry from user mode. If it were
instrumented, we could run into problems with RCU, lockdep, etc.
Mark it as `noinstr` to prevent this.
The few other functions in entry-common.c which do not have `noinstr` are
called once we've completed entry, and are safe to instrument.
Mark Rutland [Wed, 14 Jul 2021 14:38:41 +0000 (15:38 +0100)]
arm64: mte: fix restoration of GCR_EL1 from suspend
Since commit:
bad1e1c663e0a72f ("arm64: mte: switch GCR_EL1 in kernel entry and exit")
we saved/restored the user GCR_EL1 value at exception boundaries, and
update_gcr_el1_excl() is no longer used for this. However it is used to
restore the kernel's GCR_EL1 value when returning from a suspend state.
Thus, the comment is misleading (and an ISB is necessary).
When restoring the kernel's GCR value, we need an ISB to ensure this is
used by subsequent instructions. We don't necessarily get an ISB by
other means (e.g. if the kernel is built without support for pointer
authentication). As __cpu_setup() initialised GCR_EL1.Exclude to 0xffff,
until a context synchronization event, allocation tag 0 may be used
rather than the desired set of tags.
This patch drops the misleading comment, adds the missing ISB, and for
clarity folds update_gcr_el1_excl() into its only user.
Robin Murphy [Mon, 12 Jul 2021 14:27:46 +0000 (15:27 +0100)]
arm64: Avoid premature usercopy failure
Al reminds us that the usercopy API must only return complete failure
if absolutely nothing could be copied. Currently, if userspace does
something silly like giving us an unaligned pointer to Device memory,
or a size which overruns MTE tag bounds, we may fail to honour that
requirement when faulting on a multi-byte access even though a smaller
access could have succeeded.
Add a mitigation to the fixup routines to fall back to a single-byte
copy if we faulted on a larger access before anything has been written
to the destination, to guarantee making *some* forward progress. We
needn't be too concerned about the overall performance since this should
only occur when callers are doing something a bit dodgy in the first
place. Particularly broken userspace might still be able to trick
generic_perform_write() into an infinite loop by targeting write() at
an mmap() of some read-only device register where the fault-in load
succeeds but any store synchronously aborts such that copy_to_user() is
genuinely unable to make progress, but, well, don't do that...
xen-blkfront has a weird protocol where close message from the remote
side can be delayed, and where hot removals are treated somewhat
differently from regular removals, all leading to potential NULL
pointer removals, and a del_gendisk from the block device release
method, which will deadlock. Fix this by just performing normal hot
removals even when the device is opened like all other Linux block
drivers.
Merge tag 'nvme-5.14-2021-07-15' of git://git.infradead.org/nvme into block-5.14
Pull NVMe fixes from Christoph:
"nvme fixes for Linux 5.14
- fix various races in nvme-pci when shutting down just after probing
(Casey Chen)
- fix a net_device leak in nvme-tcp (Prabhakar Kushwaha)"
* tag 'nvme-5.14-2021-07-15' of git://git.infradead.org/nvme:
nvme-pci: do not call nvme_dev_remove_admin from nvme_remove
nvme-pci: fix multiple races in nvme_setup_io_queues
nvme-tcp: use __dev_get_by_name instead dev_get_by_name for OPT_HOST_IFACE
Rob Herring [Tue, 13 Jul 2021 19:34:53 +0000 (13:34 -0600)]
dt-bindings: More dropping redundant minItems/maxItems
Another round of removing redundant minItems/maxItems from new schema in
the recent merge window.
If a property has an 'items' list, then a 'minItems' or 'maxItems' with the
same size as the list is redundant and can be dropped. Note that is DT
schema specific behavior and not standard json-schema behavior. The tooling
will fixup the final schema adding any unspecified minItems/maxItems.
This condition is partially checked with the meta-schema already, but
only if both 'minItems' and 'maxItems' are equal to the 'items' length.
An improved meta-schema is pending.
Vitaly Kuznetsov [Mon, 28 Jun 2021 10:44:25 +0000 (12:44 +0200)]
KVM: selftests: smm_test: Test SMM enter from L2
Two additional tests are added:
- SMM triggered from L2 does not currupt L1 host state.
- Save/restore during SMM triggered from L2 does not corrupt guest/host
state.
Vitaly Kuznetsov [Mon, 28 Jun 2021 10:44:24 +0000 (12:44 +0200)]
KVM: nSVM: Restore nested control upon leaving SMM
If the VM was migrated while in SMM, no nested state was saved/restored,
and therefore svm_leave_smm has to load both save and control area
of the vmcb12. Save area is already loaded from HSAVE area,
so now load the control area as well from the vmcb12.
Vitaly Kuznetsov [Mon, 28 Jun 2021 10:44:23 +0000 (12:44 +0200)]
KVM: nSVM: Fix L1 state corruption upon return from SMM
VMCB split commit 4995a3685f1b ("KVM: SVM: Use a separate vmcb for the
nested L2 guest") broke return from SMM when we entered there from guest
(L2) mode. Gen2 WS2016/Hyper-V is known to do this on boot. The problem
manifests itself like this:
Note: return to L2 succeeded but upon first exit to L1 its RIP points to
'RSM' instruction but we're not in SMM.
The problem appears to be that VMCB01 gets irreversibly destroyed during
SMM execution. Previously, we used to have 'hsave' VMCB where regular
(pre-SMM) L1's state was saved upon nested_svm_vmexit() but now we just
switch to VMCB01 from VMCB02.
Pre-split (working) flow looked like:
- SMM is triggered during L2's execution
- L2's state is pushed to SMRAM
- nested_svm_vmexit() restores L1's state from 'hsave'
- SMM -> RSM
- enter_svm_guest_mode() switches to L2 but keeps 'hsave' intact so we have
pre-SMM (and pre L2 VMRUN) L1's state there
- L2's state is restored from SMRAM
- upon first exit L1's state is restored from L1.
This was always broken with regards to svm_get_nested_state()/
svm_set_nested_state(): 'hsave' was never a part of what's being
save and restored so migration happening during SMM triggered from L2 would
never restore L1's state correctly.
Post-split flow (broken) looks like:
- SMM is triggered during L2's execution
- L2's state is pushed to SMRAM
- nested_svm_vmexit() switches to VMCB01 from VMCB02
- SMM -> RSM
- enter_svm_guest_mode() switches from VMCB01 to VMCB02 but pre-SMM VMCB01
is already lost.
- L2's state is restored from SMRAM
- upon first exit L1's state is restored from VMCB01 but it is corrupted
(reflects the state during 'RSM' execution).
VMX doesn't have this problem because unlike VMCB, VMCS keeps both guest
and host state so when we switch back to VMCS02 L1's state is intact there.
To resolve the issue we need to save L1's state somewhere. We could've
created a third VMCB for SMM but that would require us to modify saved
state format. L1's architectural HSAVE area (pointed by MSR_VM_HSAVE_PA)
seems appropriate: L0 is free to save any (or none) of L1's state there.
Currently, KVM does 'none'.
Note, for nested state migration to succeed, both source and destination
hypervisors must have the fix. We, however, don't need to create a new
flag indicating the fact that HSAVE area is now populated as migration
during SMM triggered from L2 was always broken.
Fixes: 4995a3685f1b ("KVM: SVM: Use a separate vmcb for the nested L2 guest") Signed-off-by: Vitaly Kuznetsov <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
Vitaly Kuznetsov [Mon, 28 Jun 2021 10:44:22 +0000 (12:44 +0200)]
KVM: nSVM: Introduce svm_copy_vmrun_state()
Separate the code setting non-VMLOAD-VMSAVE state from
svm_set_nested_state() into its own function. This is going to be
re-used from svm_enter_smm()/svm_leave_smm().