Git Repo - linux.git/log

drm/amd/display: Enable timing sync on DCN32

Missed enabling timing sync on DCN32 because DCN32 has a different DML
param.

Tested-by: Mark Broadworth <[email protected]>
Reviewed-by: Martin Leung <[email protected]>
Reviewed-by: Jun Lei <[email protected]>
Acked-by: Rodrigo Siqueira <[email protected]>
Signed-off-by: Alvin Lee <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

Merge tag 'scmi-fixes-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into arm/fixes

Arm SCMI fixes for v6.1

A bunch of fixes to handle:
1. A possible resource leak in scmi_remove(). The returned error
   value gets ignored by the driver core and can remove the device and
   free the devm-allocated resources. As a simple solution to be able to
   easily backport, the bind attributes in the driver is suppressed as
   there is no need to support it. Additionally the remove path is cleaned
   up by adding device links between the core and the protocol devices
   so that a proper and complete unbinding happens.
2. A possible spin-loop in the SCMI transmit path in case of misbehaving
   platform firmware. A timeout is added to the existing loop so that
   the SCMI stack can bailout aborting the transmission with warnings.
3. Optional Rx channel correctly by reporting any memory errors instead
   of ignoring the same with other allowed errors.
4. The use of proper device for all the device managed allocations in the
   virtio transport.
5. Incorrect deferred_tx_wq release on the error paths by using devres
   API(devm_add_action_or_reset) to manage the release in the error path.

* tag 'scmi-fixes-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
  firmware: arm_scmi: Fix deferred_tx_wq release on error paths
  firmware: arm_scmi: Fix devres allocation device in virtio transport
  firmware: arm_scmi: Make Rx chan_setup fail on memory errors
  firmware: arm_scmi: Make tx_prepare time out eventually
  firmware: arm_scmi: Suppress the driver's bind attributes
  firmware: arm_scmi: Cleanup the core driver removal callback

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Arnd Bergmann <[email protected]>

drm/amd/display: Set memclk levels to be at least 1 for dcn32

[Why]
Cannot report 0 memclk levels even when SMU does not provide any.

[How]
When memclk levels reported by SMU is 0, set levels to 1.

Tested-by: Mark Broadworth <[email protected]>
Reviewed-by: Martin Leung <[email protected]>
Acked-by: Rodrigo Siqueira <[email protected]>
Signed-off-by: Dillon Varone <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected] # 6.0.x

drm/amd/display: Update latencies on DCN321

Update DF related latencies based on new measurements.

Tested-by: Mark Broadworth <[email protected]>
Reviewed-by: Jun Lei <[email protected]>
Acked-by: Rodrigo Siqueira <[email protected]>
Signed-off-by: Dillon Varone <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected] # 6.0.x

drm/amd/display: Limit dcn32 to 1950Mhz display clock

[why]
Hardware team recommends we limit dispclock to 1950Mhz for all DCN3.2.x

[how]
Limit to 1950 when initializing clocks.

Tested-by: Mark Broadworth <[email protected]>
Reviewed-by: Alvin Lee <[email protected]>
Acked-by: Rodrigo Siqueira <[email protected]>
Signed-off-by: Jun Lei <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected] # 6.0.x

drm/amd/display: Ignore Cable ID Feature

Ignore cable ID for DP2 receivers that does not support the feature.

Tested-by: Mark Broadworth <[email protected]>
Reviewed-by: Roman Li <[email protected]>
Acked-by: Rodrigo Siqueira <[email protected]>
Signed-off-by: Fangzhi Zuo <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>

drm/amd/display: Update DSC capabilitie for DCN314

dcn314 has 4 DSC - conflicted hardware document updated and confirmed.

Tested-by: Mark Broadworth <[email protected]>
Reviewed-by: Charlene Liu <[email protected]>
Acked-by: Rodrigo Siqueira <[email protected]>
Signed-off-by: Leo Chen <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
Cc: [email protected] # 6.0.x

Documentation: devres: add missing I2C helper

Add missing devm_i2c_add_adapter() to devres.rst. It's introduced by
commit 07740c92ae57 ("i2c: core: add managed function for adding i2c
adapters").

Fixes: 07740c92ae57 ("i2c: core: add managed function for adding i2c adapters")
Signed-off-by: Yang Yingliang <[email protected]>
Acked-by: Yicong Yang <[email protected]>
Reviewed-by: Andy Shevchenko <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

Merge tag 'parisc-for-6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux

Pull parisc architecture fixes from Helge Deller:
"This mostly handles oddities with the serial port 8250_gsc.c driver.

  Although the name suggests it's just for serial ports on the GSC bus
  (e.g. in older PA-RISC machines), it handles serial ports on PA-RISC
  PCI devices (e.g. on the SuperIO chip) as well.

  Thus this renames the driver to 8250_parisc and fixes the config
  dependencies.

  The other change is a cleanup on how the device IDs of devices in a
  PA-RISC machine are shown at startup"

* tag 'parisc-for-6.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Avoid printing the hardware path twice
  parisc: Export iosapic_serial_irq() symbol for serial port driver
  MAINTAINERS: adjust entry after renaming parisc serial driver
  parisc: Use signed char for hardware path in pdc.h
  parisc/serial: Rename 8250_gsc.c to 8250_parisc.c
  parisc: Make 8250_gsc driver dependend on CONFIG_PARISC

netfilter: ipset: enforce documented limit to prevent allocating huge memory

Daniel Xu reported that the hash:net,iface type of the ipset subsystem does
not limit adding the same network with different interfaces to a set, which
can lead to huge memory usage or allocation failure.

The quick reproducer is

$ ipset create ACL.IN.ALL_PERMIT hash:net,iface hashsize 1048576 timeout 0
$ for i in $(seq 0 100); do /sbin/ipset add ACL.IN.ALL_PERMIT 0.0.0.0/0,kaf_$i timeout 0 -exist; done

The backtrace when vmalloc fails:

        [Tue Oct 25 00:13:08 2022] ipset: vmalloc error: size 1073741848, exceeds total pages
        <...>
        [Tue Oct 25 00:13:08 2022] Call Trace:
        [Tue Oct 25 00:13:08 2022]  <TASK>
        [Tue Oct 25 00:13:08 2022]  dump_stack_lvl+0x48/0x60
        [Tue Oct 25 00:13:08 2022]  warn_alloc+0x155/0x180
        [Tue Oct 25 00:13:08 2022]  __vmalloc_node_range+0x72a/0x760
        [Tue Oct 25 00:13:08 2022]  ? hash_netiface4_add+0x7c0/0xb20
        [Tue Oct 25 00:13:08 2022]  ? __kmalloc_large_node+0x4a/0x90
        [Tue Oct 25 00:13:08 2022]  kvmalloc_node+0xa6/0xd0
        [Tue Oct 25 00:13:08 2022]  ? hash_netiface4_resize+0x99/0x710
        <...>

The fix is to enforce the limit documented in the ipset(8) manpage:

>  The internal restriction of the hash:net,iface set type is that the same
>  network prefix cannot be stored with more than 64 different interfaces
>  in a single set.

Fixes: ccf0a4b7fc68 ("netfilter: ipset: Add bucketsize parameter to all hash types")
Reported-by: Daniel Xu <[email protected]>
Signed-off-by: Jozsef Kadlecsik <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

Merge tag 'nfs-for-6.1-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:

- Fix some coccicheck warnings

- Avoid memcpy() run-time warning

- Fix up various state reclaim / RECLAIM_COMPLETE errors

- Fix a null pointer dereference in sysfs

- Fix LOCK races

- Fix gss_unwrap_resp_integ() crasher

- Fix zero length clones

- Fix memleak when allocate slot fails

* tag 'nfs-for-6.1-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  nfs4: Fix kmemleak when allocate slot failed
  NFSv4.2: Fixup CLONE dest file size for zero-length count
  SUNRPC: Fix crasher in gss_unwrap_resp_integ()
  NFSv4: Retry LOCK on OLD_STATEID during delegation return
  SUNRPC: Fix null-ptr-deref when xps sysfs alloc failed
  NFSv4.1: We must always send RECLAIM_COMPLETE after a reboot
  NFSv4.1: Handle RECLAIM_COMPLETE trunking errors
  NFSv4: Fix a potential state reclaim deadlock
  NFS: Avoid memcpy() run-time warning for struct sockaddr overflows
  nfs: Remove redundant null checks before kfree

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:
"Fix a few more of the usual sorts of bugs:

   - Another regression with source route validation in CMA, introduced
     this merge window

   - Crash in hfi1 due to faulty list operations

   - PCI ID updates for EFA

   - Disable LOCAL_INV in hns because it causes a HW hang

   - Crash in hns due to missing initialization

   - Memory leak in rxe

   - Missing error unwind during ib_core module loading

   - Missing error handling in qedr around work queue creation during
     startup"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  RDMA/qedr: clean up work queue on failure in qedr_alloc_resources()
  RDMA/core: Fix null-ptr-deref in ib_core_cleanup()
  RDMA/rxe: Fix mr leak in RESPST_ERR_RNR
  RDMA/hns: Fix NULL pointer problem in free_mr_init()
  RDMA/hns: Disable local invalidate operation
  RDMA/efa: Add EFA 0xefa2 PCI ID
  IB/hfi1: Correctly move list in sc_disable()
  RDMA/cma: Use output interface for net_dev check

KVM: VMX: Ignore guest CPUID for host userspace writes to DEBUGCTL

Ignore guest CPUID for host userspace writes to the DEBUGCTL MSR, KVM's
ABI is that setting CPUID vs. state can be done in any order, i.e. KVM
allows userspace to stuff MSRs prior to setting the guest's CPUID that
makes the new MSR "legal".

Keep the vmx_get_perf_capabilities() check for guest writes, even though
it's technically unnecessary since the vCPU's PERF_CAPABILITIES is
consulted when refreshing LBR support. A future patch will clean up
vmx_get_perf_capabilities() to avoid the RDMSR on every call, at which
point the paranoia will incur no meaningful overhead.

Note, prior to vmx_get_perf_capabilities() checking that the host fully
supports LBRs via x86_perf_get_lbr(), KVM effectively relied on
intel_pmu_lbr_is_enabled() to guard against host userspace enabling LBRs
on platforms without full support.

Fixes: c646236344e9 ("KVM: vmx/pmu: Add PMU_CAP_LBR_FMT check when guest LBR is enabled")
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20221006000314 [email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Fold vmx_supported_debugctl() into vcpu_supported_debugctl()

Fold vmx_supported_debugctl() into vcpu_supported_debugctl(), its only
caller. Setting bits only to clear them a few instructions later is
rather silly, and splitting the logic makes things seem more complicated
than they actually are.

Opportunistically drop DEBUGCTLMSR_LBR_MASK now that there's a single
reference to the pair of bits. The extra layer of indirection provides
no meaningful value and makes it unnecessarily tedious to understand
what KVM is doing.

No functional change.

Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20221006000314 [email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>

KVM: VMX: Advertise PMU LBRs if and only if perf supports LBRs

Advertise LBR support to userspace via MSR_IA32_PERF_CAPABILITIES if and
only if perf fully supports LBRs. Perf may disable LBRs (by zeroing the
number of LBRs) even on platforms the allegedly support LBRs, e.g. if
probing any LBR MSRs during setup fails.

Fixes: be635e34c284 ("KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIES")
Reported-by: Like Xu <[email protected]>
Signed-off-by: Sean Christopherson <[email protected]>
Message-Id: <20221006000314 [email protected]>
Cc: [email protected]
Signed-off-by: Paolo Bonzini <[email protected]>

btrfs: fix inode reserve space leak due to nowait buffered write

During a nowait buffered write, if we fail to balance dirty pages we exit
btrfs_buffered_write() without releasing the delalloc space reserved for
an extent, resulting in leaking space from the inode's block reserve.

So fix that by releasing the delalloc space for the extent when balancing
dirty pages fails.

Reported-by: kernel test robot <[email protected]>
Link: https://lore.kernel.org/all/[email protected]
Fixes: 965f47aeb5de ("btrfs: make btrfs_buffered_write nowait compatible")
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: fix nowait buffered write returning -ENOSPC

If we are doing a buffered write in NOWAIT context and we can't reserve
metadata space due to -ENOSPC, then we should return -EAGAIN so that we
retry the write in a context allowed to block and do metadata reservation
with flushing, which might succeed this time due to the allowed flushing.

Returning -ENOSPC while in NOWAIT context simply makes some writes fail
with -ENOSPC when they would likely succeed after switching from NOWAIT
context to blocking context. That is unexpected behaviour and even fio
complains about it with a warning like this:

  fio: io_u error on file /mnt/sdi/task_0.0.0: No space left on device: write offset=1535705088, buflen=65536
  fio: pid=592630, err=28/file:io_u.c:1846, func=io_u error, error=No space left on device

The fio's job config is this:

   [global]
   bs=64K
   ioengine=io_uring
   iodepth=1
   size=2236962133
   nr_files=1
   filesize=2236962133
   direct=0
   runtime=10
   fallocate=posix
   io_size=2236962133
   group_reporting
   time_based

   [task_0]
   rw=randwrite
   directory=/mnt/sdi
   numjobs=4

So fix this by returning -EAGAIN if we are in NOWAIT context and the
metadata reservation failed with -ENOSPC.

Fixes: 304e45acdb8f ("btrfs: plumb NOWAIT through the write path")
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: remove pointless and double ulist frees in error paths of qgroup tests

Several places in the qgroup self tests follow the pattern of freeing the
ulist pointer they passed to btrfs_find_all_roots() if the call to that
function returned an error. That is pointless because that function always
frees the ulist in case it returns an error.

Also In some places like at test_multiple_refs(), after a call to
btrfs_qgroup_account_extent() we also leave "old_roots" and "new_roots"
pointing to ulists that were freed, because btrfs_qgroup_account_extent()
has freed those ulists, and if after that the next call to
btrfs_find_all_roots() fails, we call ulist_free() on the "old_roots"
ulist again, resulting in a double free.

So remove those calls to reduce the code size and avoid double ulist
free in case of an error.

Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: fix ulist leaks in error paths of qgroup self tests

In the test_no_shared_qgroup() and test_multiple_refs() qgroup self tests,
if we fail to add the tree ref, remove the extent item or remove the
extent ref, we are returning from the test function without freeing the
"old_roots" ulist that was allocated by the previous calls to
btrfs_find_all_roots(). Fix that by calling ulist_free() before returning.

Fixes: 442244c96332 ("btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism.")
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: fix inode list leak during backref walking at find_parent_nodes()

During backref walking, at find_parent_nodes(), if we are dealing with a
data extent and we get an error while resolving the indirect backrefs, at
resolve_indirect_refs(), or in the while loop that iterates over the refs
in the direct refs rbtree, we end up leaking the inode lists attached to
the direct refs we have in the direct refs rbtree that were not yet added
to the refs ulist passed as argument to find_parent_nodes(). Since they
were not yet added to the refs ulist and prelim_release() does not free
the lists, on error the caller can only free the lists attached to the
refs that were added to the refs ulist, all the remaining refs get their
inode lists never freed, therefore leaking their memory.

Fix this by having prelim_release() always free any attached inode list
to each ref found in the rbtree, and have find_parent_nodes() set the
ref's inode list to NULL once it transfers ownership of the inode list
to a ref added to the refs ulist passed to find_parent_nodes().

Fixes: 86d5f9944252 ("btrfs: convert prelimary reference tracking to use rbtrees")
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>

btrfs: fix inode list leak during backref walking at resolve_indirect_refs()

During backref walking, at resolve_indirect_refs(), if we get an error
we jump to the 'out' label and call ulist_free() on the 'parents' ulist,
which frees all the elements in the ulist - however that does not free
any inode lists that may be attached to elements, through the 'aux' field
of a ulist node, so we end up leaking lists if we have any attached to
the unodes.

Fix this by calling free_leaf_list() instead of ulist_free() when we exit
from resolve_indirect_refs(). The static function free_leaf_list() is
moved up for this to be possible and it's slightly simplified by removing
unnecessary code.

Fixes: 3301958b7c1d ("Btrfs: add inodes before dropping the extent lock in find_all_leafs")
Signed-off-by: Filipe Manana <[email protected]>
Signed-off-by: David Sterba <[email protected]>

Merge branch 'misdn-fixes'

Yang Yingliang says:

====================
two fixes for mISDN

This patchset fixes two issues when device_add() returns error.
====================

Signed-off-by: David S. Miller <[email protected]>

isdn: mISDN: netjet: fix wrong check of device registration

The class is set in mISDN_register_device(), but if device_add() returns
error, it will lead to delete a device without added, fix this by using
device_is_registered() to check if the device is registered.

Fixes: a900845e5661 ("mISDN: Add support for Traverse Technologies NETJet PCI cards")
Signed-off-by: Yang Yingliang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mISDN: fix possible memory leak in mISDN_register_device()

Afer commit 1fa5ae857bb1 ("driver core: get rid of struct device's
bus_id string array"), the name of device is allocated dynamically,
add put_device() to give up the reference, so that the name can be
freed in kobject_cleanup() when the refcount is 0.

Set device class before put_device() to avoid null release() function
WARN message in device_release().

Fixes: 1fa5ae857bb1 ("driver core: get rid of struct device's bus_id string array")
Signed-off-by: Yang Yingliang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

rose: Fix NULL pointer dereference in rose_send_frame()

The syzkaller reported an issue:

KASAN: null-ptr-deref in range [0x0000000000000380-0x0000000000000387]
CPU: 0 PID: 4069 Comm: kworker/0:15 Not tainted 6.0.0-syzkaller-02734-g0326074ff465 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022
Workqueue: rcu_gp srcu_invoke_callbacks
RIP: 0010:rose_send_frame+0x1dd/0x2f0 net/rose/rose_link.c:101
Call Trace:
<IRQ>
rose_transmit_clear_request+0x1d5/0x290 net/rose/rose_link.c:255
rose_rx_call_request+0x4c0/0x1bc0 net/rose/af_rose.c:1009
rose_loopback_timer+0x19e/0x590 net/rose/rose_loopback.c:111
call_timer_fn+0x1a0/0x6b0 kernel/time/timer.c:1474
expire_timers kernel/time/timer.c:1519 [inline]
__run_timers.part.0+0x674/0xa80 kernel/time/timer.c:1790
__run_timers kernel/time/timer.c:1768 [inline]
run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1803
__do_softirq+0x1d0/0x9c8 kernel/softirq.c:571
[...]
</IRQ>

It triggers NULL pointer dereference when 'neigh->dev->dev_addr' is
called in the rose_send_frame(). It's the first occurrence of the
`neigh` is in rose_loopback_timer() as `rose_loopback_neigh', and
the 'dev' in 'rose_loopback_neigh' is initialized sa nullptr.

It had been fixed by commit 3b3fd068c56e3fbea30090859216a368398e39bf
("rose: Fix Null pointer dereference in rose_send_frame()") ever.
But it's introduced by commit 3c53cd65dece47dd1f9d3a809f32e59d1d87b2b8
("rose: check NULL rose_loopback_neigh->loopback") again.

We fix it by add NULL check in rose_transmit_clear_request(). When
the 'dev' in 'neigh' is NULL, we don't reply the request and just
clear it.

syzkaller don't provide repro, and I provide a syz repro like:
r0 = syz_init_net_socket$bt_sco(0x1f, 0x5, 0x2)
ioctl$sock_inet_SIOCSIFFLAGS(r0, 0x8914, &(0x7f0000000180)={'rose0\x00', 0x201})
r1 = syz_init_net_socket$rose(0xb, 0x5, 0x0)
bind$rose(r1, &(0x7f00000000c0)=@full={0xb, @dev, @null, 0x0, [@null, @null, @netrom, @netrom, @default, @null]}, 0x40)
connect$rose(r1, &(0x7f0000000240)=@short={0xb, @dev={0xbb, 0xbb, 0xbb, 0x1, 0x0}, @remote={0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0xcc, 0x1}, 0x1, @netrom={0xbb, 0xbb, 0xbb, 0xbb, 0xbb, 0x0, 0x0}}, 0x1c)

Fixes: 3c53cd65dece ("rose: check NULL rose_loopback_neigh->loopback")
Signed-off-by: Zhang Qilong <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

perf/x86/intel: Add Cooper Lake stepping to isolation_ucodes[]

The intel_pebs_isolation quirk checks both model number and stepping.
Cooper Lake has a different stepping (11) than the other Skylake Xeon.
It cannot benefit from the optimization in commit 9b545c04abd4f
("perf/x86/kvm: Avoid unnecessary work in guest filtering").

Add the stepping of Cooper Lake into the isolation_ucodes[] table.

Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

perf/x86/intel: Fix pebs event constraints for SPR

According to the latest event list, update the MEM_INST_RETIRED events
which support the DataLA facility for SPR.

Fixes: 61b985e3e775 ("perf/x86/intel: Add perf core PMU support for Sapphire Rapids")
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

perf/x86/intel: Fix pebs event constraints for ICL

According to the latest event list, update the MEM_INST_RETIRED events
which support the DataLA facility.

Fixes: 6017608936c1 ("perf/x86/intel: Add Icelake support")
Reported-by: Jannis Klinkenberg <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

perf/x86/rapl: Use standard Energy Unit for SPR Dram RAPL domain

Intel Xeon servers used to use a fixed energy resolution (15.3uj) for
Dram RAPL domain. But on SPR, Dram RAPL domain follows the standard
energy resolution as described in MSR_RAPL_POWER_UNIT.

Remove the SPR Dram energy unit quirk.

Fixes: bcfd218b6679 ("perf/x86/rapl: Add support for Intel SPR platform")
Signed-off-by: Zhang Rui <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Kan Liang <[email protected]>
Tested-by: Wang Wendy <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

perf/hw_breakpoint: test: Skip the test if dependencies unmet

Running the test currently fails on non-SMP systems, despite being
enabled by default. This means that running the test with:

./tools/testing/kunit/kunit.py run --arch x86_64 hw_breakpoint

results in every hw_breakpoint test failing with:

# test_one_cpu: failed to initialize: -22
not ok 1 - test_one_cpu

Instead, use kunit_skip(), which will mark the test as skipped, and give
a more comprehensible message:

ok 1 - test_one_cpu # SKIP not enough cpus

This makes it more obvious that the test is not suited to the test
environment, and so wasn't run, rather than having run and failed.

Signed-off-by: David Gow <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Daniel Latypov <[email protected]>
Acked-by: Marco Elver <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

netfilter: nf_nat: Fix possible memory leak in nf_nat_init()

In nf_nat_init(), register_nf_nat_bpf() can fail and return directly
without any error handling.
Then nf_nat_bysource will leak and registering of &nat_net_ops,
&follow_master_nat and nf_nat_hook won't be reverted.

This leaves wild ops in linkedlists and when another module tries to
call register_pernet_operations() or nf_ct_helper_expectfn_register()
it triggers page fault:

BUG: unable to handle page fault for address: fffffbfff81b964c
RIP: 0010:register_pernet_operations+0x1b9/0x5f0
Call Trace:
<TASK>
  register_pernet_subsys+0x29/0x40
  ebtables_init+0x58/0x1000 [ebtables]
  ...

Fixes: 820dc0523e05 ("net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.c")
Signed-off-by: Chen Zhongjin <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

selftests/pidfd_test: Remove the erroneous ','

Remove the erroneous ',', otherwise it might result in wrong output
and report:
...
Bail out! (errno %d)
test: Unexpected epoll_wait result (c=4208480, events=2)
...

Fixes: 740378dc7834 ("pidfd: add polling selftests")
Signed-off-by: Zhao Gongyi <[email protected]>
Signed-off-by: Shuah Khan <[email protected]>

ipvs: fix WARNING in ip_vs_app_net_cleanup()

During the initialization of ip_vs_app_net_init(), if file ip_vs_app
fails to be created, the initialization is successful by default.
Therefore, the ip_vs_app file doesn't be found during the remove in
ip_vs_app_net_cleanup(). It will cause WRNING.

The following is the stack information:
name 'ip_vs_app'
WARNING: CPU: 1 PID: 9 at fs/proc/generic.c:712 remove_proc_entry+0x389/0x460
Modules linked in:
Workqueue: netns cleanup_net
RIP: 0010:remove_proc_entry+0x389/0x460
Call Trace:
<TASK>
ops_exit_list+0x125/0x170
cleanup_net+0x4ea/0xb00
process_one_work+0x9bf/0x1710
worker_thread+0x665/0x1080
kthread+0x2e4/0x3a0
ret_from_fork+0x1f/0x30
</TASK>

Fixes: 457c4cbc5a3d ("[NET]: Make /proc/net per network namespace")
Signed-off-by: Zhengchao Shao <[email protected]>
Acked-by: Julian Anastasov <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

ipvs: fix WARNING in __ip_vs_cleanup_batch()

During the initialization of ip_vs_conn_net_init(), if file ip_vs_conn
or ip_vs_conn_sync fails to be created, the initialization is successful
by default. Therefore, the ip_vs_conn or ip_vs_conn_sync file doesn't
be found during the remove.

The following is the stack information:
name 'ip_vs_conn_sync'
WARNING: CPU: 3 PID: 9 at fs/proc/generic.c:712
remove_proc_entry+0x389/0x460
Modules linked in:
Workqueue: netns cleanup_net
RIP: 0010:remove_proc_entry+0x389/0x460
Call Trace:
<TASK>
__ip_vs_cleanup_batch+0x7d/0x120
ops_exit_list+0x125/0x170
cleanup_net+0x4ea/0xb00
process_one_work+0x9bf/0x1710
worker_thread+0x665/0x1080
kthread+0x2e4/0x3a0
ret_from_fork+0x1f/0x30
</TASK>

Fixes: 61b1ab4583e2 ("IPVS: netns, add basic init per netns.")
Signed-off-by: Zhengchao Shao <[email protected]>
Acked-by: Julian Anastasov <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

ipvs: use explicitly signed chars

The `char` type with no explicit sign is sometimes signed and sometimes
unsigned. This code will break on platforms such as arm, where char is
unsigned. So mark it here as explicitly signed, so that the
todrop_counter decrement and subsequent comparison is correct.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jason A. Donenfeld <[email protected]>
Acked-by: Julian Anastasov <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

kconfig: fix segmentation fault in menuconfig search

Since commit d05377e184fc ("kconfig: Create links to main menu items
in search"), menuconfig shows a jump key next to "Main menu" if the
nearest visible parent is the rootmenu. If you press that jump key,
menuconfig crashes with a segmentation fault.

For example, do this:

$ make ARCH=arm64 allnoconfig menuconfig

Press '/' to search for the string "ACPI". Press '1' to choose
"(1) Main menu". Then, menuconfig crashed with a segmentation fault.

The following code in search_conf()

conf(targets[i]->parent, targets[i]);

results in NULL pointer dereference because targets[i] is the rootmenu,
which does not have a parent.

Commit d05377e184fc tried to fix the issue of top-level items not having
a jump key, but adding the "Main menu" was not the right fix.

The correct fix is to show the searched item itself. This fixes another
weird behavior described in the comment block.

Fixes: d05377e184fc ("kconfig: Create links to main menu items in search")
Reported-by: Johannes Zink <[email protected]>
Signed-off-by: Masahiro Yamada <[email protected]>
Tested-by: Bagas Sanjaya <[email protected]>
Tested-by: Johannes Zink <[email protected]>

netlink: introduce bigendian integer types

Jakub reported that the addition of the "network_byte_order"
member in struct nla_policy increases size of 32bit platforms.

Instead of scraping the bit from elsewhere Johannes suggested
to add explicit NLA_BE types instead, so do this here.

NLA_POLICY_MAX_BE() macro is removed again, there is no need
for it: NLA_POLICY_MAX(NLA_BE.., ..) will do the right thing.

NLA_BE64 can be added later.

Fixes: 08724ef69907 ("netlink: introduce NLA_POLICY_MAX_BE")
Reported-by: Jakub Kicinski <[email protected]>
Suggested-by: Johannes Berg <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: lan966x: Fix unmapping of received frames using FDMA

When lan966x was receiving a frame, then it was building the skb and
after that it was calling dma_unmap_single with frame size as the
length. This actually has 2 issues:
1. It is using a length to map and a different length to unmap.
2. When the unmap was happening, the data was sync for cpu but it could
be that this will overwrite what build_skb was initializing.

The fix for these two problems is to change the order of operations.
First to sync the frame for cpu, then to build the skb and in the end to
unmap using the correct size but without sync the frame again for cpu.

Fixes: c8349639324a ("net: lan966x: Add FDMA functionality")
Signed-off-by: Horatiu Vultur <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'net-lan966x-fixes-for-when-mtu-is-changed'

Horatiu Vultur says:

====================
net: lan966x: Fixes for when MTU is changed

There were multiple problems in different parts of the driver when
the MTU was changed.
The first problem was that the HW was missing to configure the correct
value, it was missing ETH_HLEN and ETH_FCS_LEN. The second problem was
when vlan filtering was enabled/disabled, the MRU was not adjusted
corretly. While the last issue was that the FDMA was calculated wrongly
the correct maximum MTU.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: lan966x: Fix FDMA when MTU is changed

When MTU is changed, FDMA is required to calculate what is the maximum
size of the frame that it can received. So it can calculate what is the
page order needed to allocate for the received frames.
The first problem was that, when the max MTU was calculated it was
reading the value from dev and not from HW, so in this way it was
missing L2 header + the FCS.
The other problem was that once the skb is created using
__build_skb_around, it would reserve some space for skb_shared_info.
So if we received a frame which size is at the limit of the page order
then the creating will failed because it would not have space to put all
the data.

Fixes: 2ea1cbac267e ("net: lan966x: Update FDMA to change MTU.")
Signed-off-by: Horatiu Vultur <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>

net: lan966x: Adjust maximum frame size when vlan is enabled/disabled

When vlan filtering is enabled/disabled, it is required to adjust the
maximum received frame size that it can received. When vlan filtering is
enabled, it would all to receive extra 4 bytes, that are the vlan tag.
So the maximum frame size would be 1522 with a vlan tag. If vlan
filtering is disabled then the maximum frame size would be 1518
regardless if there is or not a vlan tag.

Fixes: 6d2c186afa5d ("net: lan966x: Add vlan support.")
Signed-off-by: Horatiu Vultur <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>

net: lan966x: Fix the MTU calculation

When the MTU was changed, the lan966x didn't take in consideration
the L2 header and the FCS. So the HW was configured with a smaller
value than what was desired. Therefore the correct value to configure
the HW would be new_mtu + ETH_HLEN + ETH_FCS_LEN.
The vlan tag is not considered here, because at the time when the
blamed commit was added, there was no vlan filtering support. The
vlan fix will be part of the next patch.

Fixes: d28d6d2e37d1 ("net: lan966x: add port module support")
Signed-off-by: Horatiu Vultur <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>

x86/tdx: Panic on bad configs that #VE on "private" memory access

All normal kernel memory is "TDX private memory".  This includes
everything from kernel stacks to kernel text.  Handling
exceptions on arbitrary accesses to kernel memory is essentially
impossible because they can happen in horribly nasty places like
kernel entry/exit.  But, TDX hardware can theoretically _deliver_
a virtualization exception (#VE) on any access to private memory.

But, it's not as bad as it sounds.  TDX can be configured to never
deliver these exceptions on private memory with a "TD attribute"
called ATTR_SEPT_VE_DISABLE.  The guest has no way to *set* this
attribute, but it can check it.

Ensure ATTR_SEPT_VE_DISABLE is set in early boot.  panic() if it
is unset.  There is no sane way for Linux to run with this
attribute clear so a panic() is appropriate.

There's small window during boot before the check where kernel
has an early #VE handler. But the handler is only for port I/O
and will also panic() as soon as it sees any other #VE, such as
a one generated by a private memory access.

[ dhansen: Rewrite changelog and rebase on new tdx_parse_tdinfo().
   Add Kirill's tested-by because I made changes since
   he wrote this. ]

Fixes: 9a22bf6debbf ("x86/traps: Add #VE support for TDX guest")
Reported-by: [email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Tested-by: Kirill A. Shutemov <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/all/20221028141220.29217-3-kirill.shutemov%40linux.intel.com

cxl/region: Fix decoder allocation crash

When an intermediate port's decoders have been exhausted by existing
regions, and creating a new region with the port in question in it's
hierarchical path is attempted, cxl_port_attach_region() fails to find a
port decoder (as would be expected), and drops into the failure / cleanup
path.

However, during cleanup of the region reference, a sanity check attempts
to dereference the decoder, which in the above case didn't exist. This
causes a NULL pointer dereference BUG.

To fix this, refactor the decoder allocation and de-allocation into
helper routines, and in this 'free' routine, check that the decoder,
@cxld, is valid before attempting any operations on it.

Cc: <[email protected]>
Suggested-by: Dan Williams <[email protected]>
Signed-off-by: Vishal Verma <[email protected]>
Reviewed-by: Dave Jiang <[email protected]>
Fixes: 384e624bb211 ("cxl/region: Attach endpoint decoders")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Dan Williams <[email protected]>

Merge tag 'docs-6.1-fixes' of git://git.lwn.net/linux

Pull documentation fixes from Jonathan Corbet:
"Four small fixes for the docs tree"

* tag 'docs-6.1-fixes' of git://git.lwn.net/linux:
  docs/process/howto: Replace C89 with C11
  Documentation: Fix spelling mistake in hacking.rst
  Documentation: process: replace outdated LTS table w/ link
  tracing/histogram: Update document for KEYS_MAX size

Merge tag 'nfsd-6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fix from Chuck Lever:

- Fix a loop that occurs when using multiple net namespaces

* tag 'nfsd-6.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: fix net-namespace logic in __nfsd_file_cache_purge

nfsd: fix net-namespace logic in __nfsd_file_cache_purge

If the namespace doesn't match the one in "net", then we'll continue,
but that doesn't cause another rhashtable_walk_next call, so it will
loop infinitely.

Fixes: ce502f81ba88 ("NFSD: Convert the filecache to use rhashtable")
Reported-by: Petr Vorel <[email protected]>
Link: https://lore.kernel.org/ltp/Y1%2FP8gDAcWC%2F+VR3@pevik/
Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>

Merge tag 'nolibc-urgent.2022.10.28a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu

Pull nolibc fixes from Paul McKenney:
"This contains a couple of fixes for string-function bugs"

* tag 'nolibc-urgent.2022.10.28a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
tools/nolibc/string: Fix memcmp() implementation
tools/nolibc: Fix missing strlen() definition and infinite loop with gcc-12

arm64: booting: Document our requirements for fine grained traps with SME

With SME we require that fine grained traps on access to TPIDR2_EL0 and
SMPRI_EL1 are disabled but did not document that fact. Add the relevant
register bits.

Signed-off-by: Mark Brown <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"x86:

   - fix lock initialization race in gfn-to-pfn cache (+selftests)

   - fix two refcounting errors

   - emulator fixes

   - mask off reserved bits in CPUID

   - fix bug with disabling SGX

  RISC-V:

   - update MAINTAINERS"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86/xen: Fix eventfd error handling in kvm_xen_eventfd_assign()
  KVM: x86: smm: number of GPRs in the SMRAM image depends on the image format
  KVM: x86: emulator: update the emulation mode after CR0 write
  KVM: x86: emulator: update the emulation mode after rsm
  KVM: x86: emulator: introduce emulator_recalc_and_set_mode
  KVM: x86: emulator: em_sysexit should update ctxt->mode
  KVM: selftests: Mark "guest_saw_irq" as volatile in xen_shinfo_test
  KVM: selftests: Add tests in xen_shinfo_test to detect lock races
  KVM: Reject attempts to consume or refresh inactive gfn_to_pfn_cache
  KVM: Initialize gfn_to_pfn_cache locks in dedicated helper
  KVM: VMX: fully disable SGX if SECONDARY_EXEC_ENCLS_EXITING unavailable
  KVM: x86: Exempt pending triple fault from event injection sanity check
  MAINTAINERS: git://github -> https://github.com for kvm-riscv
  KVM: debugfs: Return retval of simple_attr_open() if it fails
  KVM: x86: Reduce refcount if single_open() fails in kvm_mmu_rmaps_stat_open()
  KVM: x86: Mask off reserved bits in CPUID.8000001FH
  KVM: x86: Mask off reserved bits in CPUID.8000001AH
  KVM: x86: Mask off reserved bits in CPUID.80000008H
  KVM: x86: Mask off reserved bits in CPUID.80000006H
  KVM: x86: Mask off reserved bits in CPUID.80000001H

Merge tag 'linux-watchdog-6.1-rc4' of git://www.linux-watchdog.org/linux-watchdog

Pull watchdog fixes from Wim Van Sebroeck:

- fix use after free in exar driver

- spelling fix in comment

* tag 'linux-watchdog-6.1-rc4' of git://www.linux-watchdog.org/linux-watchdog:
drivers: watchdog: exar_wdt.c fix use after free
watchdog: sp805_wdt: fix spelling typo in comment

arm64: entry: avoid kprobe recursion

The cortex_a76_erratum_1463225_debug_handler() function is called when
handling debug exceptions (and synchronous exceptions from BRK
instructions), and so is called when a probed function executes. If the
compiler does not inline cortex_a76_erratum_1463225_debug_handler(), it
can be probed.

If cortex_a76_erratum_1463225_debug_handler() is probed, any debug
exception or software breakpoint exception will result in recursive
exceptions leading to a stack overflow. This can be triggered with the
ftrace multiple_probes selftest, and as per the example splat below.

This is a regression caused by commit:

  6459b8469753e9fe ("arm64: entry: consolidate Cortex-A76 erratum 1463225 workaround")

... which removed the NOKPROBE_SYMBOL() annotation associated with the
function.

My intent was that cortex_a76_erratum_1463225_debug_handler() would be
inlined into its caller, el1_dbg(), which is marked noinstr and cannot
be probed. Mark cortex_a76_erratum_1463225_debug_handler() as
__always_inline to ensure this.

Example splat prior to this patch (with recursive entries elided):

| # echo p cortex_a76_erratum_1463225_debug_handler > /sys/kernel/debug/tracing/kprobe_events
| # echo p do_el0_svc >> /sys/kernel/debug/tracing/kprobe_events
| # echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable
| Insufficient stack space to handle exception!
| ESR: 0x0000000096000047 -- DABT (current EL)
| FAR: 0xffff800009cefff0
| Task stack:     [0xffff800009cf0000..0xffff800009cf4000]
| IRQ stack:      [0xffff800008000000..0xffff800008004000]
| Overflow stack: [0xffff00007fbc00f0..0xffff00007fbc10f0]
| CPU: 0 PID: 145 Comm: sh Not tainted 6.0.0 #2
| Hardware name: linux,dummy-virt (DT)
| pstate: 604003c5 (nZCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : arm64_enter_el1_dbg+0x4/0x20
| lr : el1_dbg+0x24/0x5c
| sp : ffff800009cf0000
| x29: ffff800009cf0000 x28: ffff000002c74740 x27: 0000000000000000
| x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
| x23: 00000000604003c5 x22: ffff80000801745c x21: 0000aaaac95ac068
| x20: 00000000f2000004 x19: ffff800009cf0040 x18: 0000000000000000
| x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
| x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
| x11: 0000000000000010 x10: ffff800008c87190 x9 : ffff800008ca00d0
| x8 : 000000000000003c x7 : 0000000000000000 x6 : 0000000000000000
| x5 : 0000000000000000 x4 : 0000000000000000 x3 : 00000000000043a4
| x2 : 00000000f2000004 x1 : 00000000f2000004 x0 : ffff800009cf0040
| Kernel panic - not syncing: kernel stack overflow
| CPU: 0 PID: 145 Comm: sh Not tainted 6.0.0 #2
| Hardware name: linux,dummy-virt (DT)
| Call trace:
|  dump_backtrace+0xe4/0x104
|  show_stack+0x18/0x4c
|  dump_stack_lvl+0x64/0x7c
|  dump_stack+0x18/0x38
|  panic+0x14c/0x338
|  test_taint+0x0/0x2c
|  panic_bad_stack+0x104/0x118
|  handle_bad_stack+0x34/0x48
|  __bad_stack+0x78/0x7c
|  arm64_enter_el1_dbg+0x4/0x20
|  el1h_64_sync_handler+0x40/0x98
|  el1h_64_sync+0x64/0x68
|  cortex_a76_erratum_1463225_debug_handler+0x0/0x34
...
|  el1h_64_sync_handler+0x40/0x98
|  el1h_64_sync+0x64/0x68
|  cortex_a76_erratum_1463225_debug_handler+0x0/0x34
...
|  el1h_64_sync_handler+0x40/0x98
|  el1h_64_sync+0x64/0x68
|  cortex_a76_erratum_1463225_debug_handler+0x0/0x34
|  el1h_64_sync_handler+0x40/0x98
|  el1h_64_sync+0x64/0x68
|  do_el0_svc+0x0/0x28
|  el0t_64_sync_handler+0x84/0xf0
|  el0t_64_sync+0x18c/0x190
| Kernel Offset: disabled
| CPU features: 0x0080,00005021,19001080
| Memory Limit: none
| ---[ end Kernel panic - not syncing: kernel stack overflow ]---

With this patch, cortex_a76_erratum_1463225_debug_handler() is inlined
into el1_dbg(), and el1_dbg() cannot be probed:

| # echo p cortex_a76_erratum_1463225_debug_handler > /sys/kernel/debug/tracing/kprobe_events
| sh: write error: No such file or directory
| # grep -w cortex_a76_erratum_1463225_debug_handler /proc/kallsyms | wc -l
| 0
| # echo p el1_dbg > /sys/kernel/debug/tracing/kprobe_events
| sh: write error: Invalid argument
| # grep -w el1_dbg /proc/kallsyms | wc -l
| 1

Fixes: 6459b8469753 ("arm64: entry: consolidate Cortex-A76 erratum 1463225 workaround")
Cc: <[email protected]> # 5.12.x
Signed-off-by: Mark Rutland <[email protected]>
Cc: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>

x86/tdx: Prepare for using "INFO" call for a second purpose

The TDG.VP.INFO TDCALL provides the guest with various details about
the TDX system that the guest needs to run. Only one field is currently
used: 'gpa_width' which tells the guest which PTE bits mark pages shared
or private.

A second field is now needed: the guest "TD attributes" to tell if
virtualization exceptions are configured in a way that can harm the guest.

Make the naming and calling convention more generic and discrete from the
mask-centric one.

Thanks to Sathya for the inspiration here, but there's no code, comments
or changelogs left from where he started.

Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Tested-by: Kirill A. Shutemov <[email protected]>
Cc: [email protected]

Merge tag 'refcount-cow-domain-6.1_2022-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.1-fixesA

xfs: improve runtime refcountbt corruption detection

Fuzz testing of the refcount btree demonstrated a weakness in validation
of refcount btree records during normal runtime.  The idea of using the
upper bit of the rc_startblock field to separate the refcount records
into one group for shared space and another for CoW staging extents was
added at the last minute.  The incore struct left this bit encoded in
the upper bit of the startblock field, which makes it all too easy for
arithmetic operations to overflow if we don't detect the cowflag
properly.

When I ran a norepair fuzz tester, I was able to crash the kernel on one
of these accidental overflows by fuzzing a key record in a node block,
which broke lookups.  To fix the problem, make the domain (shared/cow) a
separate field in the incore record.

Unfortunately, a customer also hit this once in production.  Due to bugs
in the kernel running on the VM host, writes to the disk image would
occasionally be lost.  Given sufficient memory pressure on the VM guest,
a refcountbt xfs_buf could be reclaimed and later reloaded from the
stale copy on the virtual disk.  The stale disk contents were a refcount
btree leaf block full of records for the wrong domain, and this caused
an infinite loop in the guest VM.

v2: actually include the refcount adjust loop invariant checking patch;
    move the deferred refcount continuation checks earlier in the series;
    break up the megapatch into smaller pieces; fix an uninitialized list
    error.
v3: in the continuation check patch, verify the per-ag extent before
    converting it to a fsblock

Signed-off-by: Darrick J. Wong <[email protected]>
* tag 'refcount-cow-domain-6.1_2022-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: rename XFS_REFC_COW_START to _COWFLAG
  xfs: fix uninitialized list head in struct xfs_refcount_recovery
  xfs: fix agblocks check in the cow leftover recovery function
  xfs: check record domain when accessing refcount records
  xfs: remove XFS_FIND_RCEXT_SHARED and _COW
  xfs: refactor domain and refcount checking
  xfs: report refcount domain in tracepoints
  xfs: track cow/shared record domains explicitly in xfs_refcount_irec
  xfs: refactor refcount record usage in xchk_refcountbt_rec
  xfs: move _irec structs to xfs_types.h
  xfs: check deferred refcount op continuation parameters
  xfs: create a predicate to verify per-AG extents
  xfs: make sure aglen never goes negative in xfs_refcount_adjust_extents

sfc: Fix an error handling path in efx_pci_probe()

If an error occurs after the first kzalloc() the corresponding memory
allocation is never freed.

Add the missing kfree() in the error handling path, as already done in the
remove() function.

Fixes: 7e773594dada ("sfc: Separate efx_nic memory from net_device memory")
Signed-off-by: Christophe JAILLET <[email protected]>
Acked-by: Martin Habets <[email protected]>
Link: https://lore.kernel.org/r/dc114193121c52c8fa3779e49bdd99d4b41344a9.1667077009.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jakub Kicinski <[email protected]>

KVM: arm64: Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE

The trapping of SMPRI_EL1 and TPIDR2_EL0 currently only really
work on nVHE, as only this mode uses the fine-grained trapping
that controls these two registers.

Move the trapping enable/disable code into
__{de,}activate_traps_common(), allowing it to be called when it
actually matters on VHE, and remove the flipping of EL2 control
for TPIDR2_EL0, which only affects the host access of this
register.

Fixes: 861262ab8627 ("KVM: arm64: Handle SME host state when running guests")
Reported-by: Mark Brown <[email protected]>
Reviewed-by: Mark Brown <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]

drm/imx: imx-tve: Fix return type of imx_tve_connector_mode_valid

The mode_valid field in drm_connector_helper_funcs is expected to be of
type:
enum drm_mode_status (* mode_valid) (struct drm_connector *connector,
struct drm_display_mode *mode);

The mismatched return type breaks forward edge kCFI since the underlying
function definition does not match the function hook definition.

The return type of imx_tve_connector_mode_valid should be changed from
int to enum drm_mode_status.

Reported-by: Dan Carpenter <[email protected]>
Link: https://github.com/ClangBuiltLinux/linux/issues/1703
Cc: [email protected]
Signed-off-by: Nathan Huckleberry <[email protected]>
Reviewed-by: Nathan Chancellor <[email protected]>
Reviewed-by: Fabio Estevam <[email protected]>
Reviewed-by: Philipp Zabel <[email protected]>
Signed-off-by: Philipp Zabel <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

drm/imx: Kconfig: Remove duplicated 'select DRM_KMS_HELPER' line

A duplicated line 'select DRM_KMS_HELPER' was introduced in Kconfig file
by commit 09717af7d13d ("drm: Remove CONFIG_DRM_KMS_CMA_HELPER option"),
so remove it.

Fixes: 09717af7d13d ("drm: Remove CONFIG_DRM_KMS_CMA_HELPER option")
Signed-off-by: Liu Ying <[email protected]>
Reviewed-by: Philipp Zabel <[email protected]>
Signed-off-by: Philipp Zabel <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]

i2c: i801: add lis3lv02d's I2C address for Vostro 5568

Dell Vostro 5568 laptop has lis3lv02d, but its i2c address is not known
to the kernel. Add this address.

Output of "cat /sys/devices/platform/lis3lv02d/position" on Dell Vostro
5568 laptop:
    - Horizontal: (-18,0,1044)
    - Front elevated: (522,-18,1080)
    - Left elevated: (-18,-360,1080)
    - Upside down: (36,108,-1134)

Signed-off-by: Nam Cao <[email protected]>
Reviewed-by: Jean Delvare <[email protected]>
Reviewed-by: Pali Rohár <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

i2c: tegra: Allocate DMA memory for DMA engine

When the I2C controllers are running in DMA mode, it is the DMA engine
that performs the memory accesses rather than the I2C controller. Pass
the DMA engine's struct device pointer to the DMA API to make sure the
correct DMA operations are used.

This fixes an issue where the DMA engine's SMMU stream ID needs to be
misleadingly set for the I2C controllers in device tree.

Suggested-by: Robin Murphy <[email protected]>
Signed-off-by: Thierry Reding <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

i2c: piix4: Fix adapter not be removed in piix4_remove()

In piix4_probe(), the piix4 adapter will be registered in:

   piix4_probe()
     piix4_add_adapters_sb800() / piix4_add_adapter()
       i2c_add_adapter()

Based on the probed device type, piix4_add_adapters_sb800() or single
piix4_add_adapter() will be called.
For the former case, piix4_adapter_count is set as the number of adapters,
while for antoher case it is not set and kept default *zero*.

When piix4 is removed, piix4_remove() removes the adapters added in
piix4_probe(), basing on the piix4_adapter_count value.
Because the count is zero for the single adapter case, the adapter won't
be removed and makes the sources allocated for adapter leaked, such as
the i2c client and device.

These sources can still be accessed by i2c or bus and cause problems.
An easily reproduced case is that if a new adapter is registered, i2c
will get the leaked adapter and try to call smbus_algorithm, which was
already freed:

Triggered by: rmmod i2c_piix4 && modprobe max31730

BUG: unable to handle page fault for address: ffffffffc053d860
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
Oops: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 3752 Comm: modprobe Tainted: G
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
RIP: 0010:i2c_default_probe (drivers/i2c/i2c-core-base.c:2259) i2c_core
RSP: 0018:ffff888107477710 EFLAGS: 00000246
...
<TASK>
  i2c_detect (drivers/i2c/i2c-core-base.c:2302) i2c_core
  __process_new_driver (drivers/i2c/i2c-core-base.c:1336) i2c_core
  bus_for_each_dev (drivers/base/bus.c:301)
  i2c_for_each_dev (drivers/i2c/i2c-core-base.c:1823) i2c_core
  i2c_register_driver (drivers/i2c/i2c-core-base.c:1861) i2c_core
  do_one_initcall (init/main.c:1296)
  do_init_module (kernel/module/main.c:2455)
  ...
</TASK>
---[ end trace 0000000000000000 ]---

Fix this problem by correctly set piix4_adapter_count as 1 for the
single adapter so it can be normally removed.

Fixes: 528d53a1592b ("i2c: piix4: Fix probing of reserved ports on AMD Family 16h Model 30h")
Signed-off-by: Chen Zhongjin <[email protected]>
Reviewed-by: Jean Delvare <[email protected]>
Signed-off-by: Wolfram Sang <[email protected]>

arm64: dts: juno: Add thermal critical trip points

When thermnal zones are defined, trip points definitions are mandatory.
Define a couple of critical trip points for monitoring of existing
PMIC and SOC thermal zones.

This was lost between txt to yaml conversion and was re-enforced recently
via the commit 8c596324232d ("dt-bindings: thermal: Fix missing required property")

Cc: Rob Herring <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: [email protected]
Signed-off-by: Cristian Marussi <[email protected]>
Fixes: f7b636a8d83c ("arm64: dts: juno: add thermal zones for scpi sensors")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sudeep Holla <[email protected]>

firmware: arm_scmi: Fix deferred_tx_wq release on error paths

Use devres to allocate the dedicated deferred_tx_wq polling workqueue so
as to automatically trigger the proper resource release on error path.

Reported-by: Dan Carpenter <[email protected]>
Fixes: 5a3b7185c47c ("firmware: arm_scmi: Add atomic mode support to virtio transport")
Signed-off-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sudeep Holla <[email protected]>

firmware: arm_scmi: Fix devres allocation device in virtio transport

SCMI virtio transport device managed allocations must use the main
platform device in devres operations instead of the channel devices.

Cc: Peter Hilber <[email protected]>
Fixes: 46abe13b5e3d ("firmware: arm_scmi: Add virtio transport")
Signed-off-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sudeep Holla <[email protected]>

firmware: arm_scmi: Make Rx chan_setup fail on memory errors

SCMI Rx channels are optional and they can fail to be setup when not
present but anyway channels setup routines must bail-out on memory errors.

Make channels setup, and related probing, fail when memory errors are
reported on Rx channels.

Fixes: 5c8a47a5a91d ("firmware: arm_scmi: Make scmi core independent of the transport type")
Signed-off-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sudeep Holla <[email protected]>

firmware: arm_scmi: Make tx_prepare time out eventually

SCMI transports based on shared memory, at start of transmissions, have
to wait for the shared Tx channel area to be eventually freed by the
SCMI platform before accessing the channel. In fact the channel is owned
by the SCMI platform until marked as free by the platform itself and,
as such, cannot be used by the agent until relinquished.

As a consequence a badly misbehaving SCMI platform firmware could lock
the channel indefinitely and make the kernel side SCMI stack loop
forever waiting for such channel to be freed, possibly hanging the
whole boot sequence.

Add a timeout to the existent Tx waiting spin-loop so that, when the
system ends up in this situation, the SCMI stack can at least bail-out,
nosily warn the user, and abort the transmission.

Reported-by: YaxiongTian <[email protected]>
Suggested-by: YaxiongTian <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Etienne Carriere <[email protected]>
Cc: Florian Fainelli <[email protected]>
Signed-off-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sudeep Holla <[email protected]>

firmware: arm_scmi: Suppress the driver's bind attributes

Suppress the capability to unbind the core SCMI driver since all the
SCMI stack protocol drivers depend on it.

Fixes: aa4f886f3893 ("firmware: arm_scmi: add basic driver infrastructure for SCMI")
Signed-off-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sudeep Holla <[email protected]>

firmware: arm_scmi: Cleanup the core driver removal callback

Platform drivers .remove callbacks are not supposed to fail and report
errors. Such errors are indeed ignored by the core platform drivers
and the driver unbind process is anyway completed.

The SCMI core platform driver as it is now, instead, bails out reporting
an error in case of an explicit unbind request.

Fix the removal path by adding proper device links between the core SCMI
device and the SCMI protocol devices so that a full SCMI stack unbind is
triggered when the core driver is removed. The remove process does not
bail out anymore on the anomalous conditions triggered by an explicit
unbind but the user is still warned.

Reported-by: Uwe Kleine-König <[email protected]>
Signed-off-by: Cristian Marussi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sudeep Holla <[email protected]>

MAINTAINERS: Update HiSilicon LPC BUS Driver maintainer

Add Jay Fang as the maintainer of the HiSilicon LPC BUS Driver, replacing
John Garry.

Signed-off-by: Jay Fang <[email protected]>
Link: https://lore.kernel.org/r/[email protected]'
Signed-off-by: Arnd Bergmann <[email protected]>

ARM: dts: ux500: Add trips to battery thermal zones

Recent changes to the thermal framework has made the trip
points (trips) for thermal zones compulsory, which made
the Ux500 DTS files break validation and also stopped
probing because of similar changes to the code.

Fix this by adding an "outer bounding box": battery thermal
zones should not get warmer than 70 degress, then we will
shut down.

Fixes: 8c596324232d ("dt-bindings: thermal: Fix missing required property")
Fixes: 3fd6d6e2b4e8 ("thermal/of: Rework the thermal device tree initialization")
Signed-off-by: Linus Walleij <[email protected]>
Cc: Daniel Lezcano <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]'
Signed-off-by: Arnd Bergmann <[email protected]>

Merge tag 'imx-fixes-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux into arm/fixes

i.MX fixes for 6.1:

- Fix imx93-pd driver to release resources when error occurs in probe.
- A series from Ioana Ciornei to add missing clock frequencies for MDIO
  controllers on LayerScape SoCs, so that the kernel driver can work
  independently from bootloader.
- A series from Li Jun to fix USB power domain setup in i.MX8MM/N device
  trees.
- Fix CPLD_Dn pull configuration for MX8Menlo board to avoid interfering
  with CPLD power off functionality.
- Fix ctrl_sleep_moci GPIO setup for verdin-imx8mp board.
- Fix DT schema check warnings on uSDHC clocks for imx8-ss-conn device
  tree.
- Fix up gpcv2 DT bindings to have an optional `power-domains` property.
- A couple of i.MX93 device tree fixes on S4MU interrupt and gpio-ranges
  of GPIO controllers.
- Keep PU regulator on for Quad and QuadPlus based imx6dl-yapp4 boards to
  work around a hardware design flaw in supply voltage distribution.
- Fix user push-button GPIO offset on imx6qdl-gw59 boards.

* tag 'imx-fixes-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux:
  arm64: dts: ls208xa: specify clock frequencies for the MDIO controllers
  arm64: dts: ls1088a: specify clock frequencies for the MDIO controllers
  arm64: dts: lx2160a: specify clock frequencies for the MDIO controllers
  soc: imx: imx93-pd: Fix the error handling path of imx93_pd_probe()
  arm64: dts: imx93: correct gpio-ranges
  arm64: dts: imx93: correct s4mu interrupt names
  dt-bindings: power: gpcv2: add power-domains property
  arm64: dts: imx8: correct clock order
  ARM: dts: imx6dl-yapp4: Do not allow PM to switch PU regulator off on Q/QP
  ARM: dts: imx6qdl-gw59{10,13}: fix user pushbutton GPIO offset
  arm64: dts: imx8mn: Correct the usb power domain
  arm64: dts: imx8mn: remove otg1 power domain dependency on hsio
  arm64: dts: imx8mm: correct usb power domains
  arm64: dts: imx8mm: remove otg1/2 power domain dependency on hsio
  arm64: dts: verdin-imx8mp: fix ctrl_sleep_moci
  arm64: dts: imx8mm: Enable CPLD_Dn pull down resistor on MX8Menlo

Link: https://lore.kernel.org/r/20221101031547.GB125525@dragon
Signed-off-by: Arnd Bergmann <[email protected]>

netfilter: nf_tables: release flow rule object from commit path

No need to postpone this to the commit release path, since no packets
are walking over this object, this is accessed from control plane only.
This helped uncovered UAF triggered by races with the netlink notifier.

Fixes: 9dd732e0bdf5 ("netfilter: nf_tables: memleak flow rule from commit path")
Reported-by: [email protected]
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nf_tables: netlink notifier might race to release objects

commit release path is invoked via call_rcu and it runs lockless to
release the objects after rcu grace period. The netlink notifier handler
might win race to remove objects that the transaction context is still
referencing from the commit release path.

Call rcu_barrier() to ensure pending rcu callbacks run to completion
if the list of transactions to be destroyed is not empty.

Fixes: 6001a930ce03 ("netfilter: nftables: introduce table ownership")
Reported-by: [email protected]
Signed-off-by: Pablo Neira Ayuso <[email protected]>

powerpc/32: Select ARCH_SPLIT_ARG64

On 32-bit kernels, 64-bit syscall arguments are split into two
registers. For that to work with syscall wrappers, the prototype of the
syscall must have the argument split so that the wrapper macro properly
unpacks the arguments from pt_regs.

The fanotify_mark() syscall is one such syscall, which already has a
split prototype, guarded behind ARCH_SPLIT_ARG64.

So select ARCH_SPLIT_ARG64 to get that prototype and fix fanotify_mark()
on 32-bit kernels with syscall wrappers.

Note also that fanotify_mark() is the only usage of ARCH_SPLIT_ARG64.

Fixes: 7e92e01b7245 ("powerpc: Provide syscall wrapper")
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

net: tun: fix bugs for oversize packet when napi frags enabled

Recently, we got two syzkaller problems because of oversize packet
when napi frags enabled.

One of the problems is because the first seg size of the iov_iter
from user space is very big, it is 2147479538 which is bigger than
the threshold value for bail out early in __alloc_pages(). And
skb->pfmemalloc is true, __kmalloc_reserve() would use pfmemalloc
reserves without __GFP_NOWARN flag. Thus we got a warning as following:

========================================================
WARNING: CPU: 1 PID: 17965 at mm/page_alloc.c:5295 __alloc_pages+0x1308/0x16c4 mm/page_alloc.c:5295
...
Call trace:
__alloc_pages+0x1308/0x16c4 mm/page_alloc.c:5295
__alloc_pages_node include/linux/gfp.h:550 [inline]
alloc_pages_node include/linux/gfp.h:564 [inline]
kmalloc_large_node+0x94/0x350 mm/slub.c:4038
__kmalloc_node_track_caller+0x620/0x8e4 mm/slub.c:4545
__kmalloc_reserve.constprop.0+0x1e4/0x2b0 net/core/skbuff.c:151
pskb_expand_head+0x130/0x8b0 net/core/skbuff.c:1654
__skb_grow include/linux/skbuff.h:2779 [inline]
tun_napi_alloc_frags+0x144/0x610 drivers/net/tun.c:1477
tun_get_user+0x31c/0x2010 drivers/net/tun.c:1835
tun_chr_write_iter+0x98/0x100 drivers/net/tun.c:2036

The other problem is because odd IPv6 packets without NEXTHDR_NONE
extension header and have big packet length, it is 2127925 which is
bigger than ETH_MAX_MTU(65535). After ipv6_gso_pull_exthdrs() in
ipv6_gro_receive(), network_header offset and transport_header offset
are all bigger than U16_MAX. That would trigger skb->network_header
and skb->transport_header overflow error, because they are all '__u16'
type. Eventually, it would affect the value for __skb_push(skb, value),
and make it be a big value. After __skb_push() in ipv6_gro_receive(),
skb->data would less than skb->head, an out of bounds memory bug occurred.
That would trigger the problem as following:

==================================================================
BUG: KASAN: use-after-free in eth_type_trans+0x100/0x260
...
Call trace:
dump_backtrace+0xd8/0x130
show_stack+0x1c/0x50
dump_stack_lvl+0x64/0x7c
print_address_description.constprop.0+0xbc/0x2e8
print_report+0x100/0x1e4
kasan_report+0x80/0x120
__asan_load8+0x78/0xa0
eth_type_trans+0x100/0x260
napi_gro_frags+0x164/0x550
tun_get_user+0xda4/0x1270
tun_chr_write_iter+0x74/0x130
do_iter_readv_writev+0x130/0x1ec
do_iter_write+0xbc/0x1e0
vfs_writev+0x13c/0x26c

To fix the problems, restrict the packet size less than
(ETH_MAX_MTU - NET_SKB_PAD - NET_IP_ALIGN) which has considered reserved
skb space in napi_alloc_skb() because transport_header is an offset from
skb->head. Add len check in tun_napi_alloc_frags() simply.

Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP driver")
Signed-off-by: Ziyang Xuan <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ibmvnic: change maintainers for vnic driver

Changed maintainers for vnic driver, since Dany has new responsibilities.
Also added Nick Child as reviewer.

Signed-off-by: Rick Lindsley <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

block: blk_add_rq_to_plug(): clear stale 'last' after flush

blk_mq_flush_plug_list() empties ->mq_list and request we'd peeked there
before that call is gone; in any case, we are not dealing with a mix
of requests for different queues now - there's no requests left in the
plug.

Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>

powerpc/32: fix syscall wrappers with 64-bit arguments

With the introduction of syscall wrappers all wrappers for syscalls with
64-bit arguments must be handled specially, not only those that have
unaligned 64-bit arguments. This left out the fallocate() and
sync_file_range2() syscalls.

Fixes: 7e92e01b7245 ("powerpc: Provide syscall wrapper")
Fixes: e23750623835 ("powerpc/32: fix syscall wrappers with 64-bit arguments of unaligned register-pairs")
Signed-off-by: Andreas Schwab <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

asm-generic: compat: fix compat_arg_u64() and compat_arg_u64_dual()

The macros are defined backwards.

This affects the following compat syscalls:
- compat_sys_truncate64()
- compat_sys_ftruncate64()
- compat_sys_fallocate()
- compat_sys_sync_file_range()
- compat_sys_fadvise64_64()
- compat_sys_readahead()
- compat_sys_pread64()
- compat_sys_pwrite64()

Fixes: 43d5de2b67d7 ("asm-generic: compat: Support BE for long long args in 32-bit ABIs")
Signed-off-by: Andreas Schwab <[email protected]>
[mpe: Add list of affected syscalls]
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

Merge tag 'for-6.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
"A few more fixes and regression fixes:

   - fix a corner case when handling tree-mod-log chagnes in reallocated
     notes

   - fix crash on raid0 filesystems created with <5.4 mkfs.btrfs that
     could lead to division by zero

   - add missing super block checksum verification after thawing
     filesystem

   - handle one more case in send when dealing with orphan files

   - fix parameter type mismatch for generation when reading dentry

   - improved error handling in raid56 code

   - better struct bio packing after recent cleanups"

* tag 'for-6.1-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: don't use btrfs_chunk::sub_stripes from disk
  btrfs: fix type of parameter generation in btrfs_get_dentry
  btrfs: send: fix send failure of a subcase of orphan inodes
  btrfs: make thaw time super block check to also verify checksum
  btrfs: fix tree mod log mishandling of reallocated nodes
  btrfs: reorder btrfs_bio for better packing
  btrfs: raid56: avoid double freeing for rbio if full_stripe_write() failed
  btrfs: raid56: properly handle the error when unable to find the missing stripe

Merge tag 'lsm-pr-20221031' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm

Pull LSM fix from Paul Moore:
"A single patch to the capabilities code to fix a potential memory leak
in the xattr allocation error handling"

* tag 'lsm-pr-20221031' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm:
capabilities: fix potential memleak on error path from vfs_getxattr_alloc()

KVM: Check KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL} prior to enabling them

There are two capabilities related to ring-based dirty page tracking:
KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL. Both are
supported by x86. However, arm64 supports KVM_CAP_DIRTY_LOG_RING_ACQ_REL
only when the feature is supported on arm64. The userspace doesn't have
to enable the advertised capability, meaning KVM_CAP_DIRTY_LOG_RING can
be enabled on arm64 by userspace and it's wrong.

Fix it by double checking if the capability has been advertised prior to
enabling it. It's rejected to enable the capability if it hasn't been
advertised.

Fixes: 17601bfed909 ("KVM: Add KVM_CAP_DIRTY_LOG_RING_ACQ_REL capability and config option")
Reported-by: Sean Christopherson <[email protected]>
Suggested-by: Sean Christopherson <[email protected]>
Signed-off-by: Gavin Shan <[email protected]>
Reviewed-by: Oliver Upton <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]

Merge tag 'fix-log-recovery-misuse-6.1_2022-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.1-fixes

xfs: fix various problems with log intent item recovery

Starting with 6.1-rc1, CONFIG_FORTIFY_SOURCE checks became smart enough
to detect memcpy() callers that copy beyond what seems to be the end of
a struct.  Unfortunately, gcc has a bug wherein it cannot reliably
compute the size of a struct containing another struct containing a flex
array at the end.  This is the case with the xfs log item format
structures, which means that -rc1 starts complaining all over the place.

Fix these problems by memcpying the struct head and the flex arrays
separately.  Although it's tempting to use the FLEX_ARRAY macros, the
structs involved are part of the ondisk log format.  Some day we're
going to want to make the ondisk log contents endian-safe, which means
that we will have to stop using memcpy entirely.

While we're at it, fix some deficiencies in the validation of recovered
log intent items -- if the size of the recovery buffer is not even large
enough to cover the flex array record count in the head, we should abort
the recovery of that item immediately.

The last patch of this series changes the EFI/EFD sizeof functions names
and behaviors to be consistent with the similarly named sizeof helpers
for other log intent items.

v2: fix more inadequate log intent done recovery validation and dump
    corrupt recovered items

Signed-off-by: Darrick J. Wong <[email protected]>
* tag 'fix-log-recovery-misuse-6.1_2022-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: dump corrupt recovered log intent items to dmesg consistently
  xfs: actually abort log recovery on corrupt intent-done log items
  xfs: refactor all the EFI/EFD log item sizeof logic
  xfs: fix memcpy fortify errors in EFI log format copying
  xfs: fix memcpy fortify errors in RUI log format copying
  xfs: fix memcpy fortify errors in CUI log format copying
  xfs: fix memcpy fortify errors in BUI log format copying
  xfs: fix validation in attr log item recovery

xfs: rename XFS_REFC_COW_START to _COWFLAG

We've been (ab)using XFS_REFC_COW_START as both an integer quantity and
a bit flag, even though it's *only* a bit flag. Rename the variable to
reflect its nature and update the cast target since we're not supposed
to be comparing it to xfs_agblock_t now.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>

xfs: fix uninitialized list head in struct xfs_refcount_recovery

We're supposed to initialize the list head of an object before adding it
to another list. Fix that, and stop using the kmem_{alloc,free} calls
from the Irix days.

Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree")
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>

xfs: fix agblocks check in the cow leftover recovery function

As we've seen, refcount records use the upper bit of the rc_startblock
field to ensure that all the refcount records are at the right side of
the refcount btree. This works because an AG is never allowed to have
more than (1U << 31) blocks in it. If we ever encounter a filesystem
claiming to have that many blocks, we absolutely do not want reflink
touching it at all.

However, this test at the start of xfs_refcount_recover_cow_leftovers is
slightly incorrect -- it /should/ be checking that agblocks isn't larger
than the XFS_MAX_CRC_AG_BLOCKS constant, and it should check that the
constant is never large enough to conflict with that CoW flag.

Note that the V5 superblock verifier has not historically rejected
filesystems where agblocks >= XFS_MAX_CRC_AG_BLOCKS, which is why this
ended up in the COW recovery routine.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>

xfs: check record domain when accessing refcount records

Now that we've separated the startblock and CoW/shared extent domain in
the incore refcount record structure, check the domain whenever we
retrieve a record to ensure that it's still in the domain that we want.
Depending on the circumstances, a change in domain either means we're
done processing or that we've found a corruption and need to fail out.

The refcount check in xchk_xref_is_cow_staging is redundant since
_get_rec has done that for a long time now, so we can get rid of it.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>

xfs: remove XFS_FIND_RCEXT_SHARED and _COW

Now that we have an explicit enum for shared and CoW staging extents, we
can get rid of the old FIND_RCEXT flags. Omit a couple of conversions
that disappear in the next patches.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>

xfs: refactor domain and refcount checking

Create a helper function to ensure that CoW staging extent records have
a single refcount and that shared extent records have more than 1
refcount. We'll put this to more use in the next patch.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>

xfs: report refcount domain in tracepoints

Now that we've broken out the startblock and shared/cow domain in the
incore refcount extent record structure, update the tracepoints to
report the domain.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>

xfs: track cow/shared record domains explicitly in xfs_refcount_irec

Just prior to committing the reflink code into upstream, the xfs
maintainer at the time requested that I find a way to shard the refcount
records into two domains -- one for records tracking shared extents, and
a second for tracking CoW staging extents.  The idea here was to
minimize mount time CoW reclamation by pushing all the CoW records to
the right edge of the keyspace, and it was accomplished by setting the
upper bit in rc_startblock.  We don't allow AGs to have more than 2^31
blocks, so the bit was free.

Unfortunately, this was a very late addition to the codebase, so most of
the refcount record processing code still treats rc_startblock as a u32
and pays no attention to whether or not the upper bit (the cow flag) is
set.  This is a weakness is theoretically exploitable, since we're not
fully validating the incoming metadata records.

Fuzzing demonstrates practical exploits of this weakness.  If the cow
flag of a node block key record is corrupted, a lookup operation can go
to the wrong record block and start returning records from the wrong
cow/shared domain.  This causes the math to go all wrong (since cow
domain is still implicit in the upper bit of rc_startblock) and we can
crash the kernel by tricking xfs into jumping into a nonexistent AG and
tripping over xfs_perag_get(mp, <nonexistent AG>) returning NULL.

To fix this, start tracking the domain as an explicit part of struct
xfs_refcount_irec, adjust all refcount functions to check the domain
of a returned record, and alter the function definitions to accept them
where necessary.

Found by fuzzing keys[2].cowflag = add in xfs/464.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>

xfs: refactor refcount record usage in xchk_refcountbt_rec

Consolidate the open-coded xfs_refcount_irec fields into an actual
struct and use the existing _btrec_to_irec to decode the ondisk record.
This will reduce code churn in the next patch.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>

xfs: dump corrupt recovered log intent items to dmesg consistently

If log recovery decides that an intent item is corrupt and wants to
abort the mount, capture a hexdump of the corrupt log item in the kernel
log for further analysis. Some of the log item code already did this,
so we're fixing the rest to do it consistently.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>

xfs: move _irec structs to xfs_types.h

Structure definitions for incore objects do not belong in the ondisk
format header. Move them to the incore types header where they belong.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>

xfs: actually abort log recovery on corrupt intent-done log items

If log recovery picks up intent-done log items that are not of the
correct size it needs to abort recovery and fail the mount. Debug
assertions are not good enough.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>

xfs: check deferred refcount op continuation parameters

If we're in the middle of a deferred refcount operation and decide to
roll the transaction to avoid overflowing the transaction space, we need
to check the new agbno/aglen parameters that we're about to record in
the new intent. Specifically, we need to check that the new extent is
completely within the filesystem, and that continuation does not put us
into a different AG.

If the keys of a node block are wrong, the lookup to resume an
xfs_refcount_adjust_extents operation can put us into the wrong record
block. If this happens, we might not find that we run out of aglen at
an exact record boundary, which will cause the loop control to do the
wrong thing.

The previous patch should take care of that problem, but let's add this
extra sanity check to stop corruption problems sooner than later.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>

xfs: refactor all the EFI/EFD log item sizeof logic

Refactor all the open-coded sizeof logic for EFI/EFD log item and log
format structures into common helper functions whose names reflect the
struct names.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Allison Henderson <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>

xfs: create a predicate to verify per-AG extents

Create a predicate function to verify that a given agbno/blockcount pair
fit entirely within a single allocation group and don't suffer
mathematical overflows. Refactor the existng open-coded logic; we're
going to add more calls to this function in the next patch.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>

xfs: fix memcpy fortify errors in EFI log format copying

Starting in 6.1, CONFIG_FORTIFY_SOURCE checks the length parameter of
memcpy.  Since we're already fixing problems with BUI item copying, we
should fix it everything else.

An extra difficulty here is that the ef[id]_extents arrays are declared
as single-element arrays.  This is not the convention for flex arrays in
the modern kernel, and it causes all manner of problems with static
checking tools, since they often cannot tell the difference between a
single element array and a flex array.

So for starters, change those array[1] declarations to array[]
declarations to signal that they are proper flex arrays and adjust all
the "size-1" expressions to fit the new declaration style.

Next, refactor the xfs_efi_copy_format function to handle the copying of
the head and the flex array members separately.  While we're at it, fix
a minor validation deficiency in the recovery function.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
Reviewed-by: Allison Henderson <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>

xfs: make sure aglen never goes negative in xfs_refcount_adjust_extents

Prior to calling xfs_refcount_adjust_extents, we trimmed agbno/aglen
such that the end of the range would not be in the middle of a refcount
record. If this is no longer the case, something is seriously wrong
with the btree. Bail out with a corruption error.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>