Git Repo - linux.git/log

net: ethtool: Fix RSS setting

When user submits a rxfh set command without touching XFRM_SYM_XOR,
rxfh.input_xfrm is set to RXH_XFRM_NO_CHANGE, which is equal to 0xff.

Testing if (rxfh.input_xfrm & RXH_XFRM_SYM_XOR &&
!ops->cap_rss_sym_xor_supported)
return -EOPNOTSUPP;

Will always be true on devices that don't set cap_rss_sym_xor_supported,
since rxfh.input_xfrm & RXH_XFRM_SYM_XOR is always true, if input_xfrm
was not set, i.e RXH_XFRM_NO_CHANGE=0xff, which will result in failure
of any command that doesn't require any change of XFRM, e.g RSS context
or hash function changes.

To avoid this breakage, test if rxfh.input_xfrm != RXH_XFRM_NO_CHANGE
before testing other conditions. Note that the problem will only trigger
with XFRM-aware userspace, old ethtool CLI would continue to work.

Fixes: 0dd415d15505 ("net: ethtool: add a NO_CHANGE uAPI for new RXFH's input_xfrm")
Signed-off-by: Saeed Mahameed <[email protected]>
Reviewed-by: Ahmed Zaki <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge tag 'net-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from bpf and netfilter.

  Current release - regressions:

   - core: fix rc7's __skb_datagram_iter() regression

  Current release - new code bugs:

   - eth: bnxt: fix crashes when reducing ring count with active RSS
     contexts

  Previous releases - regressions:

   - sched: fix UAF when resolving a clash

   - skmsg: skip zero length skb in sk_msg_recvmsg2

   - sunrpc: fix kernel free on connection failure in
     xs_tcp_setup_socket

   - tcp: avoid too many retransmit packets

   - tcp: fix incorrect undo caused by DSACK of TLP retransmit

   - udp: Set SOCK_RCU_FREE earlier in udp_lib_get_port().

   - eth: ks8851: fix deadlock with the SPI chip variant

   - eth: i40e: fix XDP program unloading while removing the driver

  Previous releases - always broken:

   - bpf:
       - fix too early release of tcx_entry
       - fail bpf_timer_cancel when callback is being cancelled
       - bpf: fix order of args in call to bpf_map_kvcalloc

   - netfilter: nf_tables: prefer nft_chain_validate

   - ppp: reject claimed-as-LCP but actually malformed packets

   - wireguard: avoid unaligned 64-bit memory accesses"

* tag 'net-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (33 commits)
  net, sunrpc: Remap EPERM in case of connection failure in xs_tcp_setup_socket
  net/sched: Fix UAF when resolving a clash
  net: ks8851: Fix potential TX stall after interface reopen
  udp: Set SOCK_RCU_FREE earlier in udp_lib_get_port().
  netfilter: nf_tables: prefer nft_chain_validate
  netfilter: nfnetlink_queue: drop bogus WARN_ON
  ethtool: netlink: do not return SQI value if link is down
  ppp: reject claimed-as-LCP but actually malformed packets
  selftests/bpf: Add timer lockup selftest
  net: ethernet: mtk-star-emac: set mac_managed_pm when probing
  e1000e: fix force smbus during suspend flow
  tcp: avoid too many retransmit packets
  bpf: Defer work in bpf_timer_cancel_and_free
  bpf: Fail bpf_timer_cancel when callback is being cancelled
  bpf: fix order of args in call to bpf_map_kvcalloc
  net: ethernet: lantiq_etop: fix double free in detach
  i40e: Fix XDP program unloading while removing the driver
  net: fix rc7's __skb_datagram_iter()
  net: ks8851: Fix deadlock with the SPI chip variant
  octeontx2-af: Fix incorrect value output on error path in rvu_check_rsrc_availability()
  ...

Merge tag 'vfs-6.10-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
"cachefiles:

   - Export an existing and add a new cachefile helper to be used in
     filesystems to fix reference count bugs

   - Use the newly added fscache_ty_get_volume() helper to get a
     reference count on an fscache_volume to handle volumes that are
     about to be removed cleanly

   - After withdrawing a fscache_cache via FSCACHE_CACHE_IS_WITHDRAWN
     wait for all ongoing cookie lookups to complete and for the object
     count to reach zero

   - Propagate errors from vfs_getxattr() to avoid an infinite loop in
     cachefiles_check_volume_xattr() because it keeps seeing ESTALE

   - Don't send new requests when an object is dropped by raising
     CACHEFILES_ONDEMAND_OJBSTATE_DROPPING

   - Cancel all requests for an object that is about to be dropped

   - Wait for the ondemand_boject_worker to finish before dropping a
     cachefiles object to prevent use-after-free

   - Use cyclic allocation for message ids to better handle id recycling

   - Add missing lock protection when iterating through the xarray when
     polling

  netfs:

   - Use standard logging helpers for debug logging

  VFS:

   - Fix potential use-after-free in file locks during
     trace_posix_lock_inode(). The tracepoint could fire while another
     task raced it and freed the lock that was requested to be traced

   - Only increment the nr_dentry_negative counter for dentries that are
     present on the superblock LRU. Currently, DCACHE_LRU_LIST list is
     used to detect this case. However, the flag is also raised in
     combination with DCACHE_SHRINK_LIST to indicate that dentry->d_lru
     is used. So checking only DCACHE_LRU_LIST will lead to wrong
     nr_dentry_negative count. Fix the check to not count dentries that
     are on a shrink related list

  Misc:

   - hfsplus: fix an uninitialized value issue in copy_name

   - minix: fix minixfs_rename with HIGHMEM. It still uses kunmap() even
     though we switched it to kmap_local_page() a while ago"

* tag 'vfs-6.10-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  minixfs: Fix minixfs_rename with HIGHMEM
  hfsplus: fix uninit-value in copy_name
  vfs: don't mod negative dentry count when on shrinker list
  filelock: fix potential use-after-free in posix_lock_inode
  cachefiles: add missing lock protection when polling
  cachefiles: cyclic allocation of msg_id to avoid reuse
  cachefiles: wait for ondemand_object_worker to finish when dropping object
  cachefiles: cancel all requests for the object that is being dropped
  cachefiles: stop sending new request when dropping object
  cachefiles: propagate errors from vfs_getxattr() to avoid infinite loop
  cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie()
  cachefiles: fix slab-use-after-free in fscache_withdraw_volume()
  netfs, fscache: export fscache_put_volume() and add fscache_try_get_volume()
  netfs: Switch debug logging to pr_debug()

Merge tag 'nf-24-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following batch contains Netfilter fixes for net:

Patch #1 fixes a bogus WARN_ON splat in nfnetlink_queue.

Patch #2 fixes a crash due to stack overflow in chain loop detection
by using the existing chain validation routines

Both patches from Florian Westphal.

netfilter pull request 24-07-11

* tag 'nf-24-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: prefer nft_chain_validate
netfilter: nfnetlink_queue: drop bogus WARN_ON
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2024-07-11

The following pull-request contains BPF updates for your *net* tree.

We've added 4 non-merge commits during the last 2 day(s) which contain
a total of 4 files changed, 262 insertions(+), 19 deletions(-).

The main changes are:

1) Fixes for a BPF timer lockup and a use-after-free scenario when timers
   are used concurrently, from Kumar Kartikeya Dwivedi.

2) Fix the argument order in the call to bpf_map_kvcalloc() which could
   otherwise lead to a compilation error, from Mohammad Shehar Yaar Tausif.

bpf-for-netdev

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Add timer lockup selftest
  bpf: Defer work in bpf_timer_cancel_and_free
  bpf: Fail bpf_timer_cancel when callback is being cancelled
  bpf: fix order of args in call to bpf_map_kvcalloc
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

net, sunrpc: Remap EPERM in case of connection failure in xs_tcp_setup_socket

When using a BPF program on kernel_connect(), the call can return -EPERM. This
causes xs_tcp_setup_socket() to loop forever, filling up the syslog and causing
the kernel to potentially freeze up.

Neil suggested:

  This will propagate -EPERM up into other layers which might not be ready
  to handle it. It might be safer to map EPERM to an error we would be more
  likely to expect from the network system - such as ECONNREFUSED or ENETDOWN.

ECONNREFUSED as error seems reasonable. For programs setting a different error
can be out of reach (see handling in 4fbac77d2d09) in particular on kernels
which do not have f10d05966196 ("bpf: Make BPF_PROG_RUN_ARRAY return -err
instead of allow boolean"), thus given that it is better to simply remap for
consistent behavior. UDP does handle EPERM in xs_udp_send_request().

Fixes: d74bad4e74ee ("bpf: Hooks for sys_connect")
Fixes: 4fbac77d2d09 ("bpf: Hooks for sys_bind")
Co-developed-by: Lex Siegel <[email protected]>
Signed-off-by: Lex Siegel <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Cc: Neil Brown <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Anna Schumaker <[email protected]>
Link: https://github.com/cilium/cilium/issues/33395
Link: https://lore.kernel.org/bpf/[email protected]
Link: https://patch.msgid.link/9069ec1d59e4b2129fc23433349fd5580ad43921.1720075070.git.daniel@iogearbox.net
Signed-off-by: Paolo Abeni <[email protected]>

net/sched: Fix UAF when resolving a clash

KASAN reports the following UAF:

BUG: KASAN: slab-use-after-free in tcf_ct_flow_table_process_conn+0x12b/0x380 [act_ct]
Read of size 1 at addr ffff888c07603600 by task handler130/6469

Call Trace:
  <IRQ>
  dump_stack_lvl+0x48/0x70
  print_address_description.constprop.0+0x33/0x3d0
  print_report+0xc0/0x2b0
  kasan_report+0xd0/0x120
  __asan_load1+0x6c/0x80
  tcf_ct_flow_table_process_conn+0x12b/0x380 [act_ct]
  tcf_ct_act+0x886/0x1350 [act_ct]
  tcf_action_exec+0xf8/0x1f0
  fl_classify+0x355/0x360 [cls_flower]
  __tcf_classify+0x1fd/0x330
  tcf_classify+0x21c/0x3c0
  sch_handle_ingress.constprop.0+0x2c5/0x500
  __netif_receive_skb_core.constprop.0+0xb25/0x1510
  __netif_receive_skb_list_core+0x220/0x4c0
  netif_receive_skb_list_internal+0x446/0x620
  napi_complete_done+0x157/0x3d0
  gro_cell_poll+0xcf/0x100
  __napi_poll+0x65/0x310
  net_rx_action+0x30c/0x5c0
  __do_softirq+0x14f/0x491
  __irq_exit_rcu+0x82/0xc0
  irq_exit_rcu+0xe/0x20
  common_interrupt+0xa1/0xb0
  </IRQ>
  <TASK>
  asm_common_interrupt+0x27/0x40

Allocated by task 6469:
  kasan_save_stack+0x38/0x70
  kasan_set_track+0x25/0x40
  kasan_save_alloc_info+0x1e/0x40
  __kasan_krealloc+0x133/0x190
  krealloc+0xaa/0x130
  nf_ct_ext_add+0xed/0x230 [nf_conntrack]
  tcf_ct_act+0x1095/0x1350 [act_ct]
  tcf_action_exec+0xf8/0x1f0
  fl_classify+0x355/0x360 [cls_flower]
  __tcf_classify+0x1fd/0x330
  tcf_classify+0x21c/0x3c0
  sch_handle_ingress.constprop.0+0x2c5/0x500
  __netif_receive_skb_core.constprop.0+0xb25/0x1510
  __netif_receive_skb_list_core+0x220/0x4c0
  netif_receive_skb_list_internal+0x446/0x620
  napi_complete_done+0x157/0x3d0
  gro_cell_poll+0xcf/0x100
  __napi_poll+0x65/0x310
  net_rx_action+0x30c/0x5c0
  __do_softirq+0x14f/0x491

Freed by task 6469:
  kasan_save_stack+0x38/0x70
  kasan_set_track+0x25/0x40
  kasan_save_free_info+0x2b/0x60
  ____kasan_slab_free+0x180/0x1f0
  __kasan_slab_free+0x12/0x30
  slab_free_freelist_hook+0xd2/0x1a0
  __kmem_cache_free+0x1a2/0x2f0
  kfree+0x78/0x120
  nf_conntrack_free+0x74/0x130 [nf_conntrack]
  nf_ct_destroy+0xb2/0x140 [nf_conntrack]
  __nf_ct_resolve_clash+0x529/0x5d0 [nf_conntrack]
  nf_ct_resolve_clash+0xf6/0x490 [nf_conntrack]
  __nf_conntrack_confirm+0x2c6/0x770 [nf_conntrack]
  tcf_ct_act+0x12ad/0x1350 [act_ct]
  tcf_action_exec+0xf8/0x1f0
  fl_classify+0x355/0x360 [cls_flower]
  __tcf_classify+0x1fd/0x330
  tcf_classify+0x21c/0x3c0
  sch_handle_ingress.constprop.0+0x2c5/0x500
  __netif_receive_skb_core.constprop.0+0xb25/0x1510
  __netif_receive_skb_list_core+0x220/0x4c0
  netif_receive_skb_list_internal+0x446/0x620
  napi_complete_done+0x157/0x3d0
  gro_cell_poll+0xcf/0x100
  __napi_poll+0x65/0x310
  net_rx_action+0x30c/0x5c0
  __do_softirq+0x14f/0x491

The ct may be dropped if a clash has been resolved but is still passed to
the tcf_ct_flow_table_process_conn function for further usage. This issue
can be fixed by retrieving ct from skb again after confirming conntrack.

Fixes: 0cc254e5aa37 ("net/sched: act_ct: Offload connections with commit action")
Co-developed-by: Gerald Yang <[email protected]>
Signed-off-by: Gerald Yang <[email protected]>
Signed-off-by: Chengen Du <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

net: ks8851: Fix potential TX stall after interface reopen

The amount of TX space in the hardware buffer is tracked in the tx_space
variable. The initial value is currently only set during driver probing.

After closing the interface and reopening it the tx_space variable has
the last value it had before close. If it is smaller than the size of
the first send packet after reopeing the interface the queue will be
stopped. The queue is woken up after receiving a TX interrupt but this
will never happen since we did not send anything.

This commit moves the initialization of the tx_space variable to the
ks8851_net_open function right before starting the TX queue. Also query
the value from the hardware instead of using a hard coded value.

Only the SPI chip variant is affected by this issue because only this
driver variant actually depends on the tx_space variable in the xmit
function.

Fixes: 3dc5d4454545 ("net: ks8851: Fix TX stall caused by TX buffer overrun")
Cc: "David S. Miller" <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: Simon Horman <[email protected]>
Cc: [email protected]
Cc: [email protected] # 5.10+
Signed-off-by: Ronald Wahl <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

udp: Set SOCK_RCU_FREE earlier in udp_lib_get_port().

syzkaller triggered the warning [0] in udp_v4_early_demux().

In udp_v[46]_early_demux() and sk_lookup(), we do not touch the refcount
of the looked-up sk and use sock_pfree() as skb->destructor, so we check
SOCK_RCU_FREE to ensure that the sk is safe to access during the RCU grace
period.

Currently, SOCK_RCU_FREE is flagged for a bound socket after being put
into the hash table.  Moreover, the SOCK_RCU_FREE check is done too early
in udp_v[46]_early_demux() and sk_lookup(), so there could be a small race
window:

  CPU1                                 CPU2
  ----                                 ----
  udp_v4_early_demux()                 udp_lib_get_port()
  |                                    |- hlist_add_head_rcu()
  |- sk = __udp4_lib_demux_lookup()    |
  |- DEBUG_NET_WARN_ON_ONCE(sk_is_refcounted(sk));
                                       `- sock_set_flag(sk, SOCK_RCU_FREE)

We had the same bug in TCP and fixed it in commit 871019b22d1b ("net:
set SOCK_RCU_FREE before inserting socket into hashtable").

Let's apply the same fix for UDP.

[0]:
WARNING: CPU: 0 PID: 11198 at net/ipv4/udp.c:2599 udp_v4_early_demux+0x481/0xb70 net/ipv4/udp.c:2599
Modules linked in:
CPU: 0 PID: 11198 Comm: syz-executor.1 Not tainted 6.9.0-g93bda33046e7 #13
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:udp_v4_early_demux+0x481/0xb70 net/ipv4/udp.c:2599
Code: c5 7a 15 fe bb 01 00 00 00 44 89 e9 31 ff d3 e3 81 e3 bf ef ff ff 89 de e8 2c 74 15 fe 85 db 0f 85 02 06 00 00 e8 9f 7a 15 fe <0f> 0b e8 98 7a 15 fe 49 8d 7e 60 e8 4f 39 2f fe 49 c7 46 60 20 52
RSP: 0018:ffffc9000ce3fa58 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff8318c92c
RDX: ffff888036ccde00 RSI: ffffffff8318c2f1 RDI: 0000000000000001
RBP: ffff88805a2dd6e0 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0001ffffffffffff R12: ffff88805a2dd680
R13: 0000000000000007 R14: ffff88800923f900 R15: ffff88805456004e
FS:  00007fc449127640(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fc449126e38 CR3: 000000003de4b002 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
PKRU: 55555554
Call Trace:
<TASK>
ip_rcv_finish_core.constprop.0+0xbdd/0xd20 net/ipv4/ip_input.c:349
ip_rcv_finish+0xda/0x150 net/ipv4/ip_input.c:447
NF_HOOK include/linux/netfilter.h:314 [inline]
NF_HOOK include/linux/netfilter.h:308 [inline]
ip_rcv+0x16c/0x180 net/ipv4/ip_input.c:569
__netif_receive_skb_one_core+0xb3/0xe0 net/core/dev.c:5624
__netif_receive_skb+0x21/0xd0 net/core/dev.c:5738
netif_receive_skb_internal net/core/dev.c:5824 [inline]
netif_receive_skb+0x271/0x300 net/core/dev.c:5884
tun_rx_batched drivers/net/tun.c:1549 [inline]
tun_get_user+0x24db/0x2c50 drivers/net/tun.c:2002
tun_chr_write_iter+0x107/0x1a0 drivers/net/tun.c:2048
new_sync_write fs/read_write.c:497 [inline]
vfs_write+0x76f/0x8d0 fs/read_write.c:590
ksys_write+0xbf/0x190 fs/read_write.c:643
__do_sys_write fs/read_write.c:655 [inline]
__se_sys_write fs/read_write.c:652 [inline]
__x64_sys_write+0x41/0x50 fs/read_write.c:652
x64_sys_call+0xe66/0x1990 arch/x86/include/generated/asm/syscalls_64.h:2
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0x4b/0x110 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7fc44a68bc1f
Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 e9 cf f5 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 3c d0 f5 ff 48
RSP: 002b:00007fc449126c90 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00000000004bc050 RCX: 00007fc44a68bc1f
RDX: 0000000000000032 RSI: 00000000200000c0 RDI: 00000000000000c8
RBP: 00000000004bc050 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000032 R11: 0000000000000293 R12: 0000000000000000
R13: 000000000000000b R14: 00007fc44a5ec530 R15: 0000000000000000
</TASK>

Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
Reported-by: syzkaller <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

netfilter: nf_tables: prefer nft_chain_validate

nft_chain_validate already performs loop detection because a cycle will
result in a call stack overflow (ctx->level >= NFT_JUMP_STACK_SIZE).

It also follows maps via ->validate callback in nft_lookup, so there
appears no reason to iterate the maps again.

nf_tables_check_loops() and all its helper functions can be removed.
This improves ruleset load time significantly, from 23s down to 12s.

This also fixes a crash bug. Old loop detection code can result in
unbounded recursion:

BUG: TASK stack guard page was hit at ....
Oops: stack guard page: 0000 [#1] PREEMPT SMP KASAN
CPU: 4 PID: 1539 Comm: nft Not tainted 6.10.0-rc5+ #1
[..]

with a suitable ruleset during validation of register stores.

I can't see any actual reason to attempt to check for this from
nft_validate_register_store(), at this point the transaction is still in
progress, so we don't have a full picture of the rule graph.

For nf-next it might make sense to either remove it or make this depend
on table->validate_state in case we could catch an error earlier
(for improved error reporting to userspace).

Fixes: 20a69341f2d0 ("netfilter: nf_tables: add netlink set API")
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nfnetlink_queue: drop bogus WARN_ON

Happens when rules get flushed/deleted while packet is out, so remove
this WARN_ON.

This WARN exists in one form or another since v4.14, no need to backport
this to older releases, hence use a more recent fixes tag.

Fixes: 3f8019688894 ("netfilter: move nf_reinject into nfnetlink_queue modules")
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

ethtool: netlink: do not return SQI value if link is down

Do not attach SQI value if link is down. "SQI values are only valid if
link-up condition is present" per OpenAlliance specification of
100Base-T1 Interoperability Test suite [1]. The same rule would apply
for other link types.

[1] https://opensig.org/automotive-ethernet-specifications/#

Fixes: 806602191592 ("ethtool: provide UAPI for PHY Signal Quality Index (SQI)")
Signed-off-by: Oleksij Rempel <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Reviewed-by: Woojung Huh <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

ppp: reject claimed-as-LCP but actually malformed packets

Since 'ppp_async_encode()' assumes valid LCP packets (with code
from 1 to 7 inclusive), add 'ppp_check_packet()' to ensure that
LCP packet has an actual body beyond PPP_LCP header bytes, and
reject claimed-as-LCP but actually malformed data otherwise.

Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=ec0723ba9605678b14bf
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Dmitry Antipov <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>

selftests/bpf: Add timer lockup selftest

Add a selftest that tries to trigger a situation where two timer callbacks
are attempting to cancel each other's timer. By running them continuously,
we hit a condition where both run in parallel and cancel each other.

Without the fix in the previous patch, this would cause a lockup as
hrtimer_cancel on either side will wait for forward progress from the
callback.

Ensure that this situation leads to a EDEADLK error.

Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

net: ethernet: mtk-star-emac: set mac_managed_pm when probing

The below commit introduced a warning message when phy state is not in
the states: PHY_HALTED, PHY_READY, and PHY_UP.
commit 744d23c71af3 ("net: phy: Warn about incorrect mdio_bus_phy_resume() state")

mtk-star-emac doesn't need mdiobus suspend/resume. To fix the warning
message during resume, indicate the phy resume/suspend is managed by the
mac when probing.

Fixes: 744d23c71af3 ("net: phy: Warn about incorrect mdio_bus_phy_resume() state")
Signed-off-by: Jian Hui Lee <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

e1000e: fix force smbus during suspend flow

Commit 861e8086029e ("e1000e: move force SMBUS from enable ulp function
to avoid PHY loss issue") resolved a PHY access loss during suspend on
Meteor Lake consumer platforms, but it affected corporate systems
incorrectly.

A better fix, working for both consumer and corporate systems, was
proposed in commit bfd546a552e1 ("e1000e: move force SMBUS near the end
of enable_ulp function"). However, it introduced a regression on older
devices, such as [8086:15B8], [8086:15F9], [8086:15BE].

This patch aims to fix the secondary regression, by limiting the scope of
the changes to Meteor Lake platforms only.

Fixes: bfd546a552e1 ("e1000e: move force SMBUS near the end of enable_ulp function")
Reported-by: Todd Brandt <[email protected]>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218940
Reported-by: Dieter Mummenschanz <[email protected]>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218936
Signed-off-by: Vitaly Lifshits <[email protected]>
Tested-by: Mor Bar-Gabay <[email protected]> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

tcp: avoid too many retransmit packets

If a TCP socket is using TCP_USER_TIMEOUT, and the other peer
retracted its window to zero, tcp_retransmit_timer() can
retransmit a packet every two jiffies (2 ms for HZ=1000),
for about 4 minutes after TCP_USER_TIMEOUT has 'expired'.

The fix is to make sure tcp_rtx_probe0_timed_out() takes
icsk->icsk_user_timeout into account.

Before blamed commit, the socket would not timeout after
icsk->icsk_user_timeout, but would use standard exponential
backoff for the retransmits.

Also worth noting that before commit e89688e3e978 ("net: tcp:
fix unexcepted socket die when snd_wnd is 0"), the issue
would last 2 minutes instead of 4.

Fixes: b701a99e431d ("tcp: Add tcp_clamp_rto_to_user_timeout() helper to improve accuracy")
Signed-off-by: Eric Dumazet <[email protected]>
Cc: Neal Cardwell <[email protected]>
Reviewed-by: Jason Xing <[email protected]>
Reviewed-by: Jon Maxwell <[email protected]>
Reviewed-by: Kuniyuki Iwashima <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'fixes-for-bpf-timer-lockup-and-uaf'

Kumar Kartikeya Dwivedi says:

====================
Fixes for BPF timer lockup and UAF

The following patches contain fixes for timer lockups and a
use-after-free scenario.

This set proposes to fix the following lockup situation for BPF timers.

CPU 1 CPU 2

bpf_timer_cb bpf_timer_cb
  timer_cb1   timer_cb2
    bpf_timer_cancel(timer_cb2)     bpf_timer_cancel(timer_cb1)
      hrtimer_cancel       hrtimer_cancel

In this case, both callbacks will continue waiting for each other to
finish synchronously, causing a lockup.

The proposed fix adds support for tracking in-flight cancellations
*begun by other timer callbacks* for a particular BPF timer.  Whenever
preparing to call hrtimer_cancel, a callback will increment the target
timer's counter, then inspect its in-flight cancellations, and if
non-zero, return -EDEADLK to avoid situations where the target timer's
callback is waiting for its completion.

This does mean that in cases where a callback is fired and cancelled, it
will be unable to cancel any timers in that execution. This can be
alleviated by maintaining the list of waiting callbacks in bpf_hrtimer
and searching through it to avoid interdependencies, but this may
introduce additional delays in bpf_timer_cancel, in addition to
requiring extra state at runtime which may need to be allocated or
reused from bpf_hrtimer storage. Moreover, extra synchronization is
needed to delete these elements from the list of waiting callbacks once
hrtimer_cancel has finished.

The second patch is for a deadlock situation similar to above in
bpf_timer_cancel_and_free, but also a UAF scenario that can occur if
timer is armed before entering it, if hrtimer_running check causes the
hrtimer_cancel call to be skipped.

As seen above, synchronous hrtimer_cancel would lead to deadlock (if
same callback tries to free its timer, or two timers free each other),
therefore we queue work onto the global workqueue to ensure outstanding
timers are cancelled before bpf_hrtimer state is freed.

Further details are in the patches.
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: Defer work in bpf_timer_cancel_and_free

Currently, the same case as previous patch (two timer callbacks trying
to cancel each other) can be invoked through bpf_map_update_elem as
well, or more precisely, freeing map elements containing timers. Since
this relies on hrtimer_cancel as well, it is prone to the same deadlock
situation as the previous patch.

It would be sufficient to use hrtimer_try_to_cancel to fix this problem,
as the timer cannot be enqueued after async_cancel_and_free. Once
async_cancel_and_free has been done, the timer must be reinitialized
before it can be armed again. The callback running in parallel trying to
arm the timer will fail, and freeing bpf_hrtimer without waiting is
sufficient (given kfree_rcu), and bpf_timer_cb will return
HRTIMER_NORESTART, preventing the timer from being rearmed again.

However, there exists a UAF scenario where the callback arms the timer
before entering this function, such that if cancellation fails (due to
timer callback invoking this routine, or the target timer callback
running concurrently). In such a case, if the timer expiration is
significantly far in the future, the RCU grace period expiration
happening before it will free the bpf_hrtimer state and along with it
the struct hrtimer, that is enqueued.

Hence, it is clear cancellation needs to occur after
async_cancel_and_free, and yet it cannot be done inline due to deadlock
issues. We thus modify bpf_timer_cancel_and_free to defer work to the
global workqueue, adding a work_struct alongside rcu_head (both used at
_different_ points of time, so can share space).

Update existing code comments to reflect the new state of affairs.

Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: Fail bpf_timer_cancel when callback is being cancelled

Given a schedule:

timer1 cb timer2 cb

bpf_timer_cancel(timer2); bpf_timer_cancel(timer1);

Both bpf_timer_cancel calls would wait for the other callback to finish
executing, introducing a lockup.

Add an atomic_t count named 'cancelling' in bpf_hrtimer. This keeps
track of all in-flight cancellation requests for a given BPF timer.
Whenever cancelling a BPF timer, we must check if we have outstanding
cancellation requests, and if so, we must fail the operation with an
error (-EDEADLK) since cancellation is synchronous and waits for the
callback to finish executing. This implies that we can enter a deadlock
situation involving two or more timer callbacks executing in parallel
and attempting to cancel one another.

Note that we avoid incrementing the cancelling counter for the target
timer (the one being cancelled) if bpf_timer_cancel is not invoked from
a callback, to avoid spurious errors. The whole point of detecting
cur->cancelling and returning -EDEADLK is to not enter a busy wait loop
(which may or may not lead to a lockup). This does not apply in case the
caller is in a non-callback context, the other side can continue to
cancel as it sees fit without running into errors.

Background on prior attempts:

Earlier versions of this patch used a bool 'cancelling' bit and used the
following pattern under timer->lock to publish cancellation status.

lock(t->lock);
t->cancelling = true;
mb();
if (cur->cancelling)
return -EDEADLK;
unlock(t->lock);
hrtimer_cancel(t->timer);
t->cancelling = false;

The store outside the critical section could overwrite a parallel
requests t->cancelling assignment to true, to ensure the parallely
executing callback observes its cancellation status.

It would be necessary to clear this cancelling bit once hrtimer_cancel
is done, but lack of serialization introduced races. Another option was
explored where bpf_timer_start would clear the bit when (re)starting the
timer under timer->lock. This would ensure serialized access to the
cancelling bit, but may allow it to be cleared before in-flight
hrtimer_cancel has finished executing, such that lockups can occur
again.

Thus, we choose an atomic counter to keep track of all outstanding
cancellation requests and use it to prevent lockups in case callbacks
attempt to cancel each other while executing in parallel.

Reported-by: Dohyun Kim <[email protected]>
Reported-by: Neel Natu <[email protected]>
Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
Signed-off-by: Kumar Kartikeya Dwivedi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

bpf: fix order of args in call to bpf_map_kvcalloc

The original function call passed size of smap->bucket before the number of
buckets which raises the error 'calloc-transposed-args' on compilation.

Vlastimil Babka added:

The order of parameters can be traced back all the way to 6ac99e8f23d4
("bpf: Introduce bpf sk local storage") accross several refactorings,
and that's why the commit is used as a Fixes: tag.

In v6.10-rc1, a different commit 2c321f3f70bc ("mm: change inlined
allocation helpers to account at the call site") however exposed the
order of args in a way that gcc-14 has enough visibility to start
warning about it, because (in !CONFIG_MEMCG case) bpf_map_kvcalloc is
then a macro alias for kvcalloc instead of a static inline wrapper.

To sum up the warning happens when the following conditions are all met:

- gcc-14 is used (didn't see it with gcc-13)
- commit 2c321f3f70bc is present
- CONFIG_MEMCG is not enabled in .config
- CONFIG_WERROR turns this from a compiler warning to error

Fixes: 6ac99e8f23d4 ("bpf: Introduce bpf sk local storage")
Reviewed-by: Andrii Nakryiko <[email protected]>
Tested-by: Christian Kujau <[email protected]>
Signed-off-by: Mohammad Shehar Yaar Tausif <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>

Merge tag 'mm-hotfixes-stable-2024-07-10-13-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
"21 hotfixes, 15 of which are cc:stable.

  No identifiable theme here - all are singleton patches, 19 are for MM"

* tag 'mm-hotfixes-stable-2024-07-10-13-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
  mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio
  mm/hugetlb: fix potential race in __update_and_free_hugetlb_folio()
  filemap: replace pte_offset_map() with pte_offset_map_nolock()
  arch/xtensa: always_inline get_current() and current_thread_info()
  sched.h: always_inline alloc_tag_{save|restore} to fix modpost warnings
  MAINTAINERS: mailmap: update Lorenzo Stoakes's email address
  mm: fix crashes from deferred split racing folio migration
  lib/build_OID_registry: avoid non-destructive substitution for Perl < 5.13.2 compat
  mm: gup: stop abusing try_grab_folio
  nilfs2: fix kernel bug on rename operation of broken directory
  mm/hugetlb_vmemmap: fix race with speculative PFN walkers
  cachestat: do not flush stats in recency check
  mm/shmem: disable PMD-sized page cache if needed
  mm/filemap: skip to create PMD-sized page cache if needed
  mm/readahead: limit page cache size in page_cache_ra_order()
  mm/filemap: make MAX_PAGECACHE_ORDER acceptable to xarray
  mm/damon/core: merge regions aggressively when max_nr_regions is unmet
  Fix userfaultfd_api to return EINVAL as expected
  mm: vmalloc: check if a hash-index is in cpu_possible_mask
  mm: prevent derefencing NULL ptr in pfn_section_valid()
  ...

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"One core change that moves a disk start message to a location where it
  will only be printed once instead of twice plus a couple of error
  handling race fixes in the ufs driver"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: sd: Do not repeat the starting disk message
  scsi: ufs: core: Fix ufshcd_abort_one racing issue
  scsi: ufs: core: Fix ufshcd_clear_cmd racing issue

Merge tag 'vfio-v6.10' of https://github.com/awilliam/linux-vfio

Pull VFIO fix from Alex Williamson:

- Recent stable backports are exposing a bug introduced in the v6.10
   development cycle where a counter value is uninitialized.  This leads
   to regressions in userspace drivers like QEMU where where the kernel
   might ask for an arbitrary buffer size or return out of memory itself
   based on a bogus value.  Zero initialize the counter.  (Yi Liu)

* tag 'vfio-v6.10' of https://github.com/awilliam/linux-vfio:
  vfio/pci: Init the count variable in collecting hot-reset devices

Merge tag 'bcachefs-2024-07-10' of https://evilpiepirate.org/git/bcachefs

Pull bcachefs fixes from Kent Overstreet:

- Switch some asserts to WARN()

- Fix a few "transaction not locked" asserts in the data read retry
   paths and backpointers gc

- Fix a race that would cause the journal to get stuck on a flush
   commit

- Add missing fsck checks for the fragmentation LRU

- The usual assorted ssorted syzbot fixes

* tag 'bcachefs-2024-07-10' of https://evilpiepirate.org/git/bcachefs: (22 commits)
  bcachefs: Add missing bch2_trans_begin()
  bcachefs: Fix missing error check in journal_entry_btree_keys_validate()
  bcachefs: Warn on attempting a move with no replicas
  bcachefs: bch2_data_update_to_text()
  bcachefs: Log mount failure error code
  bcachefs: Fix undefined behaviour in eytzinger1_first()
  bcachefs: Mark bch_inode_info as SLAB_ACCOUNT
  bcachefs: Fix bch2_inode_insert() race path for tmpfiles
  closures: fix closure_sync + closure debugging
  bcachefs: Fix journal getting stuck on a flush commit
  bcachefs: io clock: run timer fns under clock lock
  bcachefs: Repair fragmentation_lru in alloc_write_key()
  bcachefs: add check for missing fragmentation in check_alloc_to_lru_ref()
  bcachefs: bch2_btree_write_buffer_maybe_flush()
  bcachefs: Add missing printbuf_tabstops_reset() calls
  bcachefs: Fix loop restart in bch2_btree_transactions_read()
  bcachefs: Fix bch2_read_retry_nodecode()
  bcachefs: Don't use the new_fs() bucket alloc path on an initialized fs
  bcachefs: Fix shift greater than integer size
  bcachefs: Change bch2_fs_journal_stop() BUG_ON() to warning
  ...

Merge tag 'platform-drivers-x86-v6.10-6' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform driver fix from Hans de Goede:
"One-liner fix for a dmi_system_id array in the toshiba_acpi driver not
  being terminated properly.

  Something which somehow has escaped detection since being introduced
  in 2022 until now"

* tag 'platform-drivers-x86-v6.10-6' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
  platform/x86: toshiba_acpi: Fix array out-of-bounds access

Merge tag 'acpi-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fix from Rafael Wysocki:
"Fix the sorting of _CST output data in the ACPI processor idle driver
(Kuan-Wei Chiu)"

* tag 'acpi-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: processor_idle: Fix invalid comparison with insertion sort for latency

Merge tag 'pm-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
"Fix two issues related to boost frequencies handling, one in the
  cpufreq core and one in the ACPI cpufreq driver (Mario Limonciello)"

* tag 'pm-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: ACPI: Mark boost policy as enabled when setting boost
  cpufreq: Allow drivers to advertise boost enabled

Merge tag 'thermal-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull thermal control fixes from Rafael Wysocki:
"These fix a possible NULL pointer dereference in a thermal governor,
  fix up the handling of thermal zones enabled before their temperature
  can be determined and fix list sorting during thermal zone temperature
  updates.

  Specifics:

   - Prevent the Power Allocator thermal governor from dereferencing a
     NULL pointer if it is bound to a tripless thermal zone (Nícolas
     Prado)

   - Prevent thermal zones enabled too early from staying effectively
     dormant forever because their temperature cannot be determined
     initially (Rafael Wysocki)

   - Fix list sorting during thermal zone temperature updates to ensure
     the proper ordering of trip crossing notifications (Rafael
     Wysocki)"

* tag 'thermal-6.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  thermal: core: Fix list sorting in __thermal_zone_device_update()
  thermal: core: Call monitor_thermal_zone() if zone temperature is invalid
  thermal: gov_power_allocator: Return early in manage if trip_max is NULL

Merge tag 'devicetree-fixes-for-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux

Pull devicetree fix from Rob Herring:

- One fix for PASemi Nemo board interrupts

* tag 'devicetree-fixes-for-6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
of/irq: Disable "interrupt-map" parsing for PASEMI Nemo

vfio/pci: Init the count variable in collecting hot-reset devices

The count variable is used without initialization, it results in mistakes
in the device counting and crashes the userspace if the get hot reset info
path is triggered.

Fixes: f6944d4a0b87 ("vfio/pci: Collect hot-reset devices to local buffer")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219010
Reported-by: Žilvinas Žaltiena <[email protected]>
Cc: Beld Zhang <[email protected]>
Signed-off-by: Yi Liu <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alex Williamson <[email protected]>

platform/x86: toshiba_acpi: Fix array out-of-bounds access

In order to use toshiba_dmi_quirks[] together with the standard DMI
matching functions, it must be terminated by a empty entry.

Since this entry is missing, an array out-of-bounds access occurs
every time the quirk list is processed.

Fix this by adding the terminating empty entry.

Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Fixes: 3cb1f40dfdc3 ("drivers/platform: toshiba_acpi: Call HCI_PANEL_POWER_ON on resume on some models")
Cc: [email protected]
Signed-off-by: Armin Wolf <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Hans de Goede <[email protected]>
Signed-off-by: Hans de Goede <[email protected]>

bcachefs: Add missing bch2_trans_begin()

this fixes a 'transaction should be locked' error in backpointers fsck

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Fix missing error check in journal_entry_btree_keys_validate()

Closes: https://syzkaller.appspot.com/bug?extid=8996d8f176cf946ef641
Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Warn on attempting a move with no replicas

Instead of popping an assert in bch2_write(), WARN and print out some
debugging info.

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: bch2_data_update_to_text()

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Log mount failure error code

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Fix undefined behaviour in eytzinger1_first()

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Mark bch_inode_info as SLAB_ACCOUNT

After commit 230e9fc28604 ("slab: add SLAB_ACCOUNT flag"), we need to mark
the inode cache as SLAB_ACCOUNT, similar to commit 5d097056c9a0 ("kmemcg:
account for certain kmem allocations to memcg")

Signed-off-by: Youling Tang <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Fix bch2_inode_insert() race path for tmpfiles

Signed-off-by: Kent Overstreet <[email protected]>

closures: fix closure_sync + closure debugging

originally, stack closures were only used synchronously, and with the
original implementation of closure_sync() the ref never hit 0; thus,
closure_put_after_sub() assumes that if the ref hits 0 it's on the debug
list, in debug mode.

that's no longer true with the current implementation of closure_sync,
so we need a new magic so closure_debug_destroy() doesn't pop an assert.

Signed-off-by: Kent Overstreet <[email protected]>

bcachefs: Fix journal getting stuck on a flush commit

silly race

Signed-off-by: Kent Overstreet <[email protected]>

minixfs: Fix minixfs_rename with HIGHMEM

minixfs now uses kmap_local_page(), so we can't call kunmap() to
undo it. This one call was missed as part of the commit this fixes.

Fixes: 6628f69ee66a (minixfs: Use dir_put_page() in minix_unlink() and minix_rename())
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>

net: ethernet: lantiq_etop: fix double free in detach

The number of the currently released descriptor is never incremented
which results in the same skb being released multiple times.

Fixes: 504d4721ee8e ("MIPS: Lantiq: Add ethernet driver")
Reported-by: Joe Perches <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Aleksander Jan Bajkowski <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

i40e: Fix XDP program unloading while removing the driver

The commit 6533e558c650 ("i40e: Fix reset path while removing
the driver") introduced a new PF state "__I40E_IN_REMOVE" to block
modifying the XDP program while the driver is being removed.
Unfortunately, such a change is useful only if the ".ndo_bpf()"
callback was called out of the rmmod context because unloading the
existing XDP program is also a part of driver removing procedure.
In other words, from the rmmod context the driver is expected to
unload the XDP program without reporting any errors. Otherwise,
the kernel warning with callstack is printed out to dmesg.

Example failing scenario:
1. Load the i40e driver.
2. Load the XDP program.
3. Unload the i40e driver (using "rmmod" command).

The example kernel warning log:

[  +0.004646] WARNING: CPU: 94 PID: 10395 at net/core/dev.c:9290 unregister_netdevice_many_notify+0x7a9/0x870
[...]
[  +0.010959] RIP: 0010:unregister_netdevice_many_notify+0x7a9/0x870
[...]
[  +0.002726] Call Trace:
[  +0.002457]  <TASK>
[  +0.002119]  ? __warn+0x80/0x120
[  +0.003245]  ? unregister_netdevice_many_notify+0x7a9/0x870
[  +0.005586]  ? report_bug+0x164/0x190
[  +0.003678]  ? handle_bug+0x3c/0x80
[  +0.003503]  ? exc_invalid_op+0x17/0x70
[  +0.003846]  ? asm_exc_invalid_op+0x1a/0x20
[  +0.004200]  ? unregister_netdevice_many_notify+0x7a9/0x870
[  +0.005579]  ? unregister_netdevice_many_notify+0x3cc/0x870
[  +0.005586]  unregister_netdevice_queue+0xf7/0x140
[  +0.004806]  unregister_netdev+0x1c/0x30
[  +0.003933]  i40e_vsi_release+0x87/0x2f0 [i40e]
[  +0.004604]  i40e_remove+0x1a1/0x420 [i40e]
[  +0.004220]  pci_device_remove+0x3f/0xb0
[  +0.003943]  device_release_driver_internal+0x19f/0x200
[  +0.005243]  driver_detach+0x48/0x90
[  +0.003586]  bus_remove_driver+0x6d/0xf0
[  +0.003939]  pci_unregister_driver+0x2e/0xb0
[  +0.004278]  i40e_exit_module+0x10/0x5f0 [i40e]
[  +0.004570]  __do_sys_delete_module.isra.0+0x197/0x310
[  +0.005153]  do_syscall_64+0x85/0x170
[  +0.003684]  ? syscall_exit_to_user_mode+0x69/0x220
[  +0.004886]  ? do_syscall_64+0x95/0x170
[  +0.003851]  ? exc_page_fault+0x7e/0x180
[  +0.003932]  entry_SYSCALL_64_after_hwframe+0x71/0x79
[  +0.005064] RIP: 0033:0x7f59dc9347cb
[  +0.003648] Code: 73 01 c3 48 8b 0d 65 16 0c 00 f7 d8 64 89 01 48 83
c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 35 16 0c 00 f7 d8 64 89 01 48
[  +0.018753] RSP: 002b:00007ffffac99048 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[  +0.007577] RAX: ffffffffffffffda RBX: 0000559b9bb2f6e0 RCX: 00007f59dc9347cb
[  +0.007140] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 0000559b9bb2f748
[  +0.007146] RBP: 00007ffffac99070 R08: 1999999999999999 R09: 0000000000000000
[  +0.007133] R10: 00007f59dc9a5ac0 R11: 0000000000000206 R12: 0000000000000000
[  +0.007141] R13: 00007ffffac992d8 R14: 0000559b9bb2f6e0 R15: 0000000000000000
[  +0.007151]  </TASK>
[  +0.002204] ---[ end trace 0000000000000000 ]---

Fix this by checking if the XDP program is being loaded or unloaded.
Then, block only loading a new program while "__I40E_IN_REMOVE" is set.
Also, move testing "__I40E_IN_REMOVE" flag to the beginning of XDP_SETUP
callback to avoid unnecessary operations and checks.

Fixes: 6533e558c650 ("i40e: Fix reset path while removing the driver")
Signed-off-by: Michal Kubiak <[email protected]>
Reviewed-by: Maciej Fijalkowski <[email protected]>
Tested-by: Chandan Kumar Rout <[email protected]> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio

A kernel crash was observed when migrating hugetlb folio:

BUG: kernel NULL pointer dereference, address: 0000000000000008
PGD 0 P4D 0
Oops: Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 3435 Comm: bash Not tainted 6.10.0-rc6-00450-g8578ca01f21f #66
RIP: 0010:__folio_undo_large_rmappable+0x70/0xb0
RSP: 0018:ffffb165c98a7b38 EFLAGS: 00000097
RAX: fffffbbc44528090 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffffa30e000a2800 RSI: 0000000000000246 RDI: ffffa3153ffffcc0
RBP: fffffbbc44528000 R08: 0000000000002371 R09: ffffffffbe4e5868
R10: 0000000000000001 R11: 0000000000000001 R12: ffffa3153ffffcc0
R13: fffffbbc44468000 R14: 0000000000000001 R15: 0000000000000001
FS:  00007f5b3a716740(0000) GS:ffffa3151fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000010959a000 CR4: 00000000000006f0
Call Trace:
<TASK>
__folio_migrate_mapping+0x59e/0x950
__migrate_folio.constprop.0+0x5f/0x120
move_to_new_folio+0xfd/0x250
migrate_pages+0x383/0xd70
soft_offline_page+0x2ab/0x7f0
soft_offline_page_store+0x52/0x90
kernfs_fop_write_iter+0x12c/0x1d0
vfs_write+0x380/0x540
ksys_write+0x64/0xe0
do_syscall_64+0xb9/0x1d0
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5b3a514887
RSP: 002b:00007ffe138fce68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f5b3a514887
RDX: 000000000000000c RSI: 0000556ab809ee10 RDI: 0000000000000001
RBP: 0000556ab809ee10 R08: 00007f5b3a5d1460 R09: 000000007fffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007f5b3a61b780 R14: 00007f5b3a617600 R15: 00007f5b3a616a00

It's because hugetlb folio is passed to __folio_undo_large_rmappable()
unexpectedly.  large_rmappable flag is imperceptibly set to hugetlb folio
since commit f6a8dd98a2ce ("hugetlb: convert alloc_buddy_hugetlb_folio to
use a folio").  Then commit be9581ea8c05 ("mm: fix crashes from deferred
split racing folio migration") makes folio_migrate_mapping() call
folio_undo_large_rmappable() triggering the bug.  Fix this issue by
clearing large_rmappable flag for hugetlb folios.  They don't need that
flag set anyway.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: f6a8dd98a2ce ("hugetlb: convert alloc_buddy_hugetlb_folio to use a folio")
Fixes: be9581ea8c05 ("mm: fix crashes from deferred split racing folio migration")
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm/hugetlb: fix potential race in __update_and_free_hugetlb_folio()

There is a potential race between __update_and_free_hugetlb_folio() and
try_memory_failure_hugetlb():

CPU1 CPU2
__update_and_free_hugetlb_folio try_memory_failure_hugetlb
folio_test_hugetlb
  -- It's still hugetlb folio.
  folio_clear_hugetlb_hwpoison
     spin_lock_irq(&hugetlb_lock);
   __get_huge_page_for_hwpoison
    folio_set_hugetlb_hwpoison
  spin_unlock_irq(&hugetlb_lock);
  spin_lock_irq(&hugetlb_lock);
  __folio_clear_hugetlb(folio);
   -- Hugetlb flag is cleared but too late.
  spin_unlock_irq(&hugetlb_lock);

When the above race occurs, raw error page info will be leaked.  Even
worse, raw error pages won't have hwpoisoned flag set and hit
pcplists/buddy.  Fix this issue by deferring
folio_clear_hugetlb_hwpoison() until __folio_clear_hugetlb() is done.  So
all raw error pages will have hwpoisoned flag set.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 32c877191e02 ("hugetlb: do not clear hugetlb dtor until allocating vmemmap")
Signed-off-by: Miaohe Lin <[email protected]>
Acked-by: Muchun Song <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

filemap: replace pte_offset_map() with pte_offset_map_nolock()

The vmf->ptl in filemap_fault_recheck_pte_none() is still set from
handle_pte_fault().  But at the same time, we did a pte_unmap(vmf->pte).
After a pte_unmap(vmf->pte) unmap and rcu_read_unlock(), the page table
may be racily changed and vmf->ptl maybe fails to protect the actual page
table.  Fix this by replacing pte_offset_map() with
pte_offset_map_nolock().

As David said, the PTL pointer might be stale so if we continue to use
it infilemap_fault_recheck_pte_none(), it might trigger UAF.  Also, if
the PTL fails, the issue fixed by commit 58f327f2ce80 ("filemap: avoid
unnecessary major faults in filemap_fault()") might reappear.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 58f327f2ce80 ("filemap: avoid unnecessary major faults in filemap_fault()")
Signed-off-by: ZhangPeng <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Nanyong Sun <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Yin Fengwei <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

arch/xtensa: always_inline get_current() and current_thread_info()

Mark get_current() and current_thread_info() functions as always_inline to
fix the following modpost warning:

WARNING: modpost: vmlinux: section mismatch in reference: get_current+0xc (section: .text.unlikely) -> initcall_level_names (section: .init.data)

The warning happens when these functions are called from an __init
function and they don't get inlined (remain in the .text section) while
the value they return points into .init.data section. Assuming
get_current() always returns a valid address, this situation can happen
only during init stage and accessing .init.data from .text section during
that stage should pose no issues.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 22d407b164ff ("lib: add allocation tagging support for memory allocation profiling")
Signed-off-by: Suren Baghdasaryan <[email protected]>
Cc: Kent Overstreet <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: kernel test robot <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

sched.h: always_inline alloc_tag_{save|restore} to fix modpost warnings

Mark alloc_tag_{save|restore} as always_inline to fix the following
modpost warnings:

WARNING: modpost: vmlinux: section mismatch in reference: alloc_tag_save+0x1c (section: .text.unlikely) -> initcall_level_names (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: alloc_tag_restore+0x3c (section: .text.unlikely) -> initcall_level_names (section: .init.data)

The warnings happen when these functions are called from an __init
function and they don't get inlined (remain in the .text section) while
the value returned by get_current() points into .init.data section.
Assuming get_current() always returns a valid address, this situation can
happen only during init stage and accessing .init.data from .text section
during that stage should pose no issues.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 22d407b164ff ("lib: add allocation tagging support for memory allocation profiling")
Signed-off-by: Suren Baghdasaryan <[email protected]>
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
Cc: Kent Overstreet <[email protected]>
Cc: Chris Zankel <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Max Filippov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

net: fix rc7's __skb_datagram_iter()

X would not start in my old 32-bit partition (and the "n"-handling looks
just as wrong on 64-bit, but for whatever reason did not show up there):
"n" must be accumulated over all pages before it's added to "offset" and
compared with "copy", immediately after the skb_frag_foreach_page() loop.

Fixes: d2d30a376d9c ("net: allow skb_datagram_iter to be called from any context")
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Sagi Grimberg <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge tag '6.10-rc6-smb3-server-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:

- fix access flags to address fuse incompatibility

- fix device type returned by get filesystem info

* tag '6.10-rc6-smb3-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: discard write access to the directory open
ksmbd: return FILE_DEVICE_DISK instead of super magic

Merge tag 'linux_kselftest-fixes-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kselftest fixes from Shuah Khan
"Fixes to clang build failures to timerns, vDSO tests and fixes to vDSO
  makefile"

* tag 'linux_kselftest-fixes-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests/vDSO: remove duplicate compiler invocations from Makefile
  selftests/vDSO: remove partially duplicated "all:" target in Makefile
  selftests/vDSO: fix clang build errors and warnings
  selftest/timerns: fix clang build failures for abs() calls

s390/mm: Add NULL pointer check to crst_table_free() base_crst_free()

crst_table_free() used to work with NULL pointers before the conversion
to ptdescs. Since crst_table_free() can be called with a NULL pointer
(error handling in crst_table_upgrade() add an explicit check.

Also add the same check to base_crst_free() for consistency reasons.

In real life this should not happen, since order two GFP_KERNEL
allocations will not fail, unless FAIL_PAGE_ALLOC is enabled and used.

Reported-by: Yunseong Kim <[email protected]>
Fixes: 6326c26c1514 ("s390: convert various pgalloc functions to use ptdescs")
Signed-off-by: Heiko Carstens <[email protected]>
Acked-by: Alexander Gordeev <[email protected]>
Cc: [email protected]
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2024-07-09

The following pull-request contains BPF updates for your *net* tree.

We've added 3 non-merge commits during the last 1 day(s) which contain
a total of 5 files changed, 81 insertions(+), 11 deletions(-).

The main changes are:

1) Fix a use-after-free in a corner case where tcx_entry got released too
   early. Also add BPF test coverage along with the fix, from Daniel Borkmann.

2) Fix a kernel panic on Loongarch in sk_msg_recvmsg() which got triggered
   by running BPF sockmap selftests, from Geliang Tang.

bpf-for-netdev

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  skmsg: Skip zero length skb in sk_msg_recvmsg
  selftests/bpf: Extend tcx tests to cover late tcx_entry release
  bpf: Fix too early release of tcx_entry
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

net: ks8851: Fix deadlock with the SPI chip variant

When SMP is enabled and spinlocks are actually functional then there is
a deadlock with the 'statelock' spinlock between ks8851_start_xmit_spi
and ks8851_irq:

    watchdog: BUG: soft lockup - CPU#0 stuck for 27s!
    call trace:
      queued_spin_lock_slowpath+0x100/0x284
      do_raw_spin_lock+0x34/0x44
      ks8851_start_xmit_spi+0x30/0xb8
      ks8851_start_xmit+0x14/0x20
      netdev_start_xmit+0x40/0x6c
      dev_hard_start_xmit+0x6c/0xbc
      sch_direct_xmit+0xa4/0x22c
      __qdisc_run+0x138/0x3fc
      qdisc_run+0x24/0x3c
      net_tx_action+0xf8/0x130
      handle_softirqs+0x1ac/0x1f0
      __do_softirq+0x14/0x20
      ____do_softirq+0x10/0x1c
      call_on_irq_stack+0x3c/0x58
      do_softirq_own_stack+0x1c/0x28
      __irq_exit_rcu+0x54/0x9c
      irq_exit_rcu+0x10/0x1c
      el1_interrupt+0x38/0x50
      el1h_64_irq_handler+0x18/0x24
      el1h_64_irq+0x64/0x68
      __netif_schedule+0x6c/0x80
      netif_tx_wake_queue+0x38/0x48
      ks8851_irq+0xb8/0x2c8
      irq_thread_fn+0x2c/0x74
      irq_thread+0x10c/0x1b0
      kthread+0xc8/0xd8
      ret_from_fork+0x10/0x20

This issue has not been identified earlier because tests were done on
a device with SMP disabled and so spinlocks were actually NOPs.

Now use spin_(un)lock_bh for TX queue related locking to avoid execution
of softirq work synchronously that would lead to a deadlock.

Fixes: 3dc5d4454545 ("net: ks8851: Fix TX stall caused by TX buffer overrun")
Cc: "David S. Miller" <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: Simon Horman <[email protected]>
Cc: [email protected]
Cc: [email protected] # 5.10+
Signed-off-by: Ronald Wahl <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

octeontx2-af: Fix incorrect value output on error path in rvu_check_rsrc_availability()

In rvu_check_rsrc_availability() in case of invalid SSOW req, an incorrect
data is printed to error log. 'req->sso' value is printed instead of
'req->ssow'. Looks like "copy-paste" mistake.

Fix this mistake by replacing 'req->sso' with 'req->ssow'.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Fixes: 746ea74241fa ("octeontx2-af: Add RVU block LF provisioning support")
Signed-off-by: Aleksandr Mishin <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

bnxt: fix crashes when reducing ring count with active RSS contexts

bnxt doesn't check if a ring is used by RSS contexts when reducing
ring count. Core performs a similar check for the drivers for
the main context, but core doesn't know about additional contexts,
so it can't validate them. bnxt_fill_hw_rss_tbl_p5() uses ring
id to index bp->rx_ring[], which without the check may end up
being out of bounds.

  BUG: KASAN: slab-out-of-bounds in __bnxt_hwrm_vnic_set_rss+0xb79/0xe40
  Read of size 2 at addr ffff8881c5809618 by task ethtool/31525
  Call Trace:
  __bnxt_hwrm_vnic_set_rss+0xb79/0xe40
   bnxt_hwrm_vnic_rss_cfg_p5+0xf7/0x460
   __bnxt_setup_vnic_p5+0x12e/0x270
   __bnxt_open_nic+0x2262/0x2f30
   bnxt_open_nic+0x5d/0xf0
   ethnl_set_channels+0x5d4/0xb30
   ethnl_default_set_doit+0x2f1/0x620

Core does track the additional contexts in net-next, so we can
move this validation out of the driver as a follow up there.

Fixes: b3d0083caf9a ("bnxt_en: Support RSS contexts in ethtool .{get|set}_rxfh()")
Signed-off-by: Jakub Kicinski <[email protected]>
Reviewed-by: Pavan Chebbi <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

skmsg: Skip zero length skb in sk_msg_recvmsg

When running BPF selftests (./test_progs -t sockmap_basic) on a Loongarch
platform, the following kernel panic occurs:

  [...]
  Oops[#1]:
  CPU: 22 PID: 2824 Comm: test_progs Tainted: G           OE  6.10.0-rc2+ #18
  Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018
     ... ...
     ra: 90000000048bf6c0 sk_msg_recvmsg+0x120/0x560
    ERA: 9000000004162774 copy_page_to_iter+0x74/0x1c0
   CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
   PRMD: 0000000c (PPLV0 +PIE +PWE)
   EUEN: 00000007 (+FPE +SXE +ASXE -BTE)
   ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7)
  ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0)
   BADV: 0000000000000040
   PRID: 0014c011 (Loongson-64bit, Loongson-3C5000)
  Modules linked in: bpf_testmod(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack
  Process test_progs (pid: 2824, threadinfo=0000000000863a31, task=...)
  Stack : ...
  Call Trace:
  [<9000000004162774>] copy_page_to_iter+0x74/0x1c0
  [<90000000048bf6c0>] sk_msg_recvmsg+0x120/0x560
  [<90000000049f2b90>] tcp_bpf_recvmsg_parser+0x170/0x4e0
  [<90000000049aae34>] inet_recvmsg+0x54/0x100
  [<900000000481ad5c>] sock_recvmsg+0x7c/0xe0
  [<900000000481e1a8>] __sys_recvfrom+0x108/0x1c0
  [<900000000481e27c>] sys_recvfrom+0x1c/0x40
  [<9000000004c076ec>] do_syscall+0x8c/0xc0
  [<9000000003731da4>] handle_syscall+0xc4/0x160
  Code: ...
  ---[ end trace 0000000000000000 ]---
  Kernel panic - not syncing: Fatal exception
  Kernel relocated by 0x3510000
   .text @ 0x9000000003710000
   .data @ 0x9000000004d70000
   .bss  @ 0x9000000006469400
  ---[ end Kernel panic - not syncing: Fatal exception ]---
  [...]

This crash happens every time when running sockmap_skb_verdict_shutdown
subtest in sockmap_basic.

This crash is because a NULL pointer is passed to page_address() in the
sk_msg_recvmsg(). Due to the different implementations depending on the
architecture, page_address(NULL) will trigger a panic on Loongarch
platform but not on x86 platform. So this bug was hidden on x86 platform
for a while, but now it is exposed on Loongarch platform. The root cause
is that a zero length skb (skb->len == 0) was put on the queue.

This zero length skb is a TCP FIN packet, which was sent by shutdown(),
invoked in test_sockmap_skb_verdict_shutdown():

shutdown(p1, SHUT_WR);

In this case, in sk_psock_skb_ingress_enqueue(), num_sge is zero, and no
page is put to this sge (see sg_set_page in sg_set_page), but this empty
sge is queued into ingress_msg list.

And in sk_msg_recvmsg(), this empty sge is used, and a NULL page is got by
sg_page(sge). Pass this NULL page to copy_page_to_iter(), which passes it
to kmap_local_page() and to page_address(), then kernel panics.

To solve this, we should skip this zero length skb. So in sk_msg_recvmsg(),
if copy is zero, that means it's a zero length skb, skip invoking
copy_page_to_iter(). We are using the EFAULT return triggered by
copy_page_to_iter to check for is_fin in tcp_bpf.c.

Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Suggested-by: John Fastabend <[email protected]>
Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Reviewed-by: John Fastabend <[email protected]>
Link: https://lore.kernel.org/bpf/e3a16eacdc6740658ee02a33489b1b9d4912f378.1719992715.git.tanggeliang@kylinos.cn

net: phy: microchip: lan87xx: reinit PHY after cable test

Reinit PHY after cable test, otherwise link can't be established on
tested port. This issue is reproducible on LAN9372 switches with
integrated 100BaseT1 PHYs.

Fixes: 788050256c411 ("net: phy: microchip_t1: add cable test support for lan87xx phy")
Signed-off-by: Oleksij Rempel <[email protected]>
Reviewed-by: Andrew Lunn <[email protected]>
Reviewed-by: Michal Kubiak <[email protected]>
Reviewed-by: Florian Fainelli <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

of/irq: Disable "interrupt-map" parsing for PASEMI Nemo

Once again, we've broken PASEMI Nemo boards with its incomplete
"interrupt-map" translations. Commit 935df1bd40d4 ("of/irq: Factor out
parsing of interrupt-map parent phandle+args from of_irq_parse_raw()")
changed the behavior resulting in the existing work-around not taking
effect. Rework the work-around to just skip parsing "interrupt-map" up
front by using the of_irq_imap_abusers list.

Fixes: 935df1bd40d4 ("of/irq: Factor out parsing of interrupt-map parent phandle+args from of_irq_parse_raw()")
Reported-by: Christian Zigotzky <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Rob Herring (Arm) <[email protected]>

Merge tag 'perf-tools-fixes-for-v6.10-2024-07-08' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools

Pull perf tools fixes from Namhyung Kim:
"Fix performance issue for v6.10

  These address the performance issues reported by Matt, Namhyung and
  Linus. Recently perf changed the processing of the comm string and DSO
  using sorted arrays but this caused it to sort the array whenever
  adding a new entry.

  This caused a performance issue and the fix is to enhance the sorting
  by finding the insertion point in the sorted array and to shift
  righthand side using memmove()"

* tag 'perf-tools-fixes-for-v6.10-2024-07-08' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools:
  perf dsos: When adding a dso into sorted dsos maintain the sort order
  perf comm str: Avoid sort during insert

selftests/bpf: Extend tcx tests to cover late tcx_entry release

Add a test case which replaces an active ingress qdisc while keeping the
miniq in-tact during the transition period to the new clsact qdisc.

  # ./vmtest.sh -- ./test_progs -t tc_link
  [...]
  ./test_progs -t tc_link
  [    3.412871] bpf_testmod: loading out-of-tree module taints kernel.
  [    3.413343] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
  #332     tc_links_after:OK
  #333     tc_links_append:OK
  #334     tc_links_basic:OK
  #335     tc_links_before:OK
  #336     tc_links_chain_classic:OK
  #337     tc_links_chain_mixed:OK
  #338     tc_links_dev_chain0:OK
  #339     tc_links_dev_cleanup:OK
  #340     tc_links_dev_mixed:OK
  #341     tc_links_ingress:OK
  #342     tc_links_invalid:OK
  #343     tc_links_prepend:OK
  #344     tc_links_replace:OK
  #345     tc_links_revision:OK
  Summary: 14/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <[email protected]>
Cc: Martin KaFai Lau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>

bpf: Fix too early release of tcx_entry

Pedro Pinto and later independently also Hyunwoo Kim and Wongi Lee reported
an issue that the tcx_entry can be released too early leading to a use
after free (UAF) when an active old-style ingress or clsact qdisc with a
shared tc block is later replaced by another ingress or clsact instance.

Essentially, the sequence to trigger the UAF (one example) can be as follows:

  1. A network namespace is created
  2. An ingress qdisc is created. This allocates a tcx_entry, and
     &tcx_entry->miniq is stored in the qdisc's miniqp->p_miniq. At the
     same time, a tcf block with index 1 is created.
  3. chain0 is attached to the tcf block. chain0 must be connected to
     the block linked to the ingress qdisc to later reach the function
     tcf_chain0_head_change_cb_del() which triggers the UAF.
  4. Create and graft a clsact qdisc. This causes the ingress qdisc
     created in step 1 to be removed, thus freeing the previously linked
     tcx_entry:

     rtnetlink_rcv_msg()
       => tc_modify_qdisc()
         => qdisc_create()
           => clsact_init() [a]
         => qdisc_graft()
           => qdisc_destroy()
             => __qdisc_destroy()
               => ingress_destroy() [b]
                 => tcx_entry_free()
                   => kfree_rcu() // tcx_entry freed

  5. Finally, the network namespace is closed. This registers the
     cleanup_net worker, and during the process of releasing the
     remaining clsact qdisc, it accesses the tcx_entry that was
     already freed in step 4, causing the UAF to occur:

     cleanup_net()
       => ops_exit_list()
         => default_device_exit_batch()
           => unregister_netdevice_many()
             => unregister_netdevice_many_notify()
               => dev_shutdown()
                 => qdisc_put()
                   => clsact_destroy() [c]
                     => tcf_block_put_ext()
                       => tcf_chain0_head_change_cb_del()
                         => tcf_chain_head_change_item()
                           => clsact_chain_head_change()
                             => mini_qdisc_pair_swap() // UAF

There are also other variants, the gist is to add an ingress (or clsact)
qdisc with a specific shared block, then to replace that qdisc, waiting
for the tcx_entry kfree_rcu() to be executed and subsequently accessing
the current active qdisc's miniq one way or another.

The correct fix is to turn the miniq_active boolean into a counter. What
can be observed, at step 2 above, the counter transitions from 0->1, at
step [a] from 1->2 (in order for the miniq object to remain active during
the replacement), then in [b] from 2->1 and finally [c] 1->0 with the
eventual release. The reference counter in general ranges from [0,2] and
it does not need to be atomic since all access to the counter is protected
by the rtnl mutex. With this in place, there is no longer a UAF happening
and the tcx_entry is freed at the correct time.

Fixes: e420bed02507 ("bpf: Add fd-based tcx multi-prog infra with link support")
Reported-by: Pedro Pinto <[email protected]>
Co-developed-by: Pedro Pinto <[email protected]>
Signed-off-by: Pedro Pinto <[email protected]>
Signed-off-by: Daniel Borkmann <[email protected]>
Cc: Hyunwoo Kim <[email protected]>
Cc: Wongi Lee <[email protected]>
Cc: Martin KaFai Lau <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>

thermal: core: Fix list sorting in __thermal_zone_device_update()

The order in which lists are sorted in __thermal_zone_device_update()
is reverse with respect to what it should be due to a mistake in
thermal_trip_notify_cmp().

Fix it and observe that it is not necessary to sort the lists in
different orders. They can both be sorted in ascending order if
way_down_list is walked in reverse order which allows the code to
be slightly more straightforward (and less prone to silly mistakes).

Fixes: 7454f2c42cce ("thermal: core: Sort trip point crossing notifications by temperature")
Signed-off-by: Rafael J. Wysocki <[email protected]>
Link: https://patch.msgid.link/[email protected]

docs: networking: devlink: capitalise length value

Correct the example to match the help text from the devlink utility.

Signed-off-by: Chris Packham <[email protected]>
Reviewed-by: Przemek Kitszel <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

perf dsos: When adding a dso into sorted dsos maintain the sort order

dsos__add would add at the end of the dso array possibly requiring a
later find to re-sort the array. Patterns of find then add were
becoming O(n*log n) due to the sorts. Change the add routine to be
O(n) rather than O(1) but to maintain the sorted-ness of the dsos
array so that later finds don't need the O(n*log n) sort.

Fixes: 3f4ac23a9908 ("perf dsos: Switch backing storage to array from rbtree/list")
Reported-by: Namhyung Kim <[email protected]>
Signed-off-by: Ian Rogers <[email protected]>
Cc: Steinar Gunderson <[email protected]>
Cc: Athira Rajeev <[email protected]>
Cc: Matt Fleming <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Namhyung Kim <[email protected]>

perf comm str: Avoid sort during insert

The array is sorted, so just move the elements and insert in order.

Fixes: 13ca628716c6 ("perf comm: Add reference count checking to 'struct comm_str'")
Reported-by: Matt Fleming <[email protected]>
Signed-off-by: Ian Rogers <[email protected]>
Tested-by: Matt Fleming <[email protected]>
Cc: Steinar Gunderson <[email protected]>
Cc: Athira Rajeev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Namhyung Kim <[email protected]>

Linux 6.10-rc7

Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

Pull clk fixes from Stephen Boyd:
"A set of clk fixes for the Qualcomm, Mediatek, and Allwinner drivers:

   - Fix the Qualcomm Stromer Plus PLL set_rate() clk_op to explicitly
     set the alpha enable bit and not set bits that don't exist

   - Mark Qualcomm IPQ9574 crypto clks as voted to avoid stuck clk
     warnings

   - Fix the parent of some PLLs on Qualcomm sm6530 so their rate is
     correct

   - Fix the min/max rate clamping logic in the Allwinner driver that
     got broken in v6.9

   - Limit runtime PM enabling in the Mediatek driver to only
     mt8183-mfgcfg so that system wide resume doesn't break on other
     Mediatek SoCs"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: mediatek: mt8183: Only enable runtime PM on mt8183-mfgcfg
  clk: sunxi-ng: common: Don't call hw_to_ccu_common on hw without common
  clk: qcom: gcc-ipq9574: Add BRANCH_HALT_VOTED flag
  clk: qcom: apss-ipq-pll: remove 'config_ctl_hi_val' from Stromer pll configs
  clk: qcom: clk-alpha-pll: set ALPHA_EN bit for Stromer Plus PLLs
  clk: qcom: gcc-sm6350: Fix gpll6* & gpll7 parents

Merge tag 'powerpc-6.10-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

- Fix unnecessary copy to 0 when kernel is booted at address 0

- Fix usercopy crash when dumping dtl via debugfs

- Avoid possible crash when PCI hotplug races with error handling

- Fix kexec crash caused by scv being disabled before other CPUs
   call-in

- Fix powerpc selftests build with USERCFLAGS set

Thanks to Anjali K, Ganesh Goudar, Gautam Menghani, Jinglin Wen,
Nicholas Piggin, Sourabh Jain, Srikar Dronamraju, and Vishal Chourasia.

* tag 'powerpc-6.10-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  selftests/powerpc: Fix build with USERCFLAGS set
  powerpc/pseries: Fix scv instruction crash with kexec
  powerpc/eeh: avoid possible crash when edev->pdev changes
  powerpc/pseries: Whitelist dtl slub object for copying to userspace
  powerpc/64s: Fix unnecessary copy to 0 when kernel is booted at address 0

Merge tag '6.10-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fix from Steve French:
"Fix for smb3 readahead performance regression"

* tag '6.10-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Fix read-performance regression by dropping readahead expansion

MAINTAINERS: mailmap: update Lorenzo Stoakes's email address

Now working at Oracle.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Lorenzo Stoakes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: fix crashes from deferred split racing folio migration

Even on 6.10-rc6, I've been seeing elusive "Bad page state"s (often on
flags when freeing, yet the flags shown are not bad: PG_locked had been
set and cleared??), and VM_BUG_ON_PAGE(page_ref_count(page) == 0)s from
deferred_split_scan()'s folio_put(), and a variety of other BUG and WARN
symptoms implying double free by deferred split and large folio migration.

6.7 commit 9bcef5973e31 ("mm: memcg: fix split queue list crash when large
folio migration") was right to fix the memcg-dependent locking broken in
85ce2c517ade ("memcontrol: only transfer the memcg data for migration"),
but missed a subtlety of deferred_split_scan(): it moves folios to its own
local list to work on them without split_queue_lock, during which time
folio->_deferred_list is not empty, but even the "right" lock does nothing
to secure the folio and the list it is on.

Fortunately, deferred_split_scan() is careful to use folio_try_get(): so
folio_migrate_mapping() can avoid the race by folio_undo_large_rmappable()
while the old folio's reference count is temporarily frozen to 0 - adding
such a freeze in the !mapping case too (originally, folio lock and
unmapping and no swap cache left an anon folio unreachable, so no freezing
was needed there: but the deferred split queue offers a way to reach it).

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration")
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Cc: Barry Song <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Zi Yan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

lib/build_OID_registry: avoid non-destructive substitution for Perl < 5.13.2 compat

On a system with Perl 5.12.1, commit 5ef6dc08cfde
("lib/build_OID_registry: don't mention the full path of the script in
output") causes the build to fail with the error below.

     Bareword found where operator expected at ./lib/build_OID_registry line 41, near "s#^\Q$abs_srctree/\E##r"
     syntax error at ./lib/build_OID_registry line 41, near "s#^\Q$abs_srctree/\E##r"
     Execution of ./lib/build_OID_registry aborted due to compilation errors.
     make[3]: *** [lib/Makefile:352: lib/oid_registry_data.c] Error 255

Ahmad Fatoum analyzed that non-destructive substitution is only supported since
Perl 5.13.2. Instead of dropping `r` and having the side effect of modifying
`$0`, introduce a dedicated variable to support older Perl versions.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 5ef6dc08cfde ("lib/build_OID_registry: don't mention the full path of the script in output")
Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Paul Menzel <[email protected]>
Suggested-by: Ahmad Fatoum <[email protected]>
Cc: Uwe Kleine-König <[email protected]>
Cc: Nicolas Schier <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Ahmad Fatoum <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>

mm: gup: stop abusing try_grab_folio

A kernel warning was reported when pinning folio in CMA memory when
launching SEV virtual machine.  The splat looks like:

[  464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520
[  464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6
[  464.325477] RIP: 0010:__get_user_pages+0x423/0x520
[  464.325515] Call Trace:
[  464.325520]  <TASK>
[  464.325523]  ? __get_user_pages+0x423/0x520
[  464.325528]  ? __warn+0x81/0x130
[  464.325536]  ? __get_user_pages+0x423/0x520
[  464.325541]  ? report_bug+0x171/0x1a0
[  464.325549]  ? handle_bug+0x3c/0x70
[  464.325554]  ? exc_invalid_op+0x17/0x70
[  464.325558]  ? asm_exc_invalid_op+0x1a/0x20
[  464.325567]  ? __get_user_pages+0x423/0x520
[  464.325575]  __gup_longterm_locked+0x212/0x7a0
[  464.325583]  internal_get_user_pages_fast+0xfb/0x190
[  464.325590]  pin_user_pages_fast+0x47/0x60
[  464.325598]  sev_pin_memory+0xca/0x170 [kvm_amd]
[  464.325616]  sev_mem_enc_register_region+0x81/0x130 [kvm_amd]

Per the analysis done by yangge, when starting the SEV virtual machine, it
will call pin_user_pages_fast(..., FOLL_LONGTERM, ...) to pin the memory.
But the page is in CMA area, so fast GUP will fail then fallback to the
slow path due to the longterm pinnalbe check in try_grab_folio().

The slow path will try to pin the pages then migrate them out of CMA area.
But the slow path also uses try_grab_folio() to pin the page, it will
also fail due to the same check then the above warning is triggered.

In addition, the try_grab_folio() is supposed to be used in fast path and
it elevates folio refcount by using add ref unless zero.  We are guaranteed
to have at least one stable reference in slow path, so the simple atomic add
could be used.  The performance difference should be trivial, but the
misuse may be confusing and misleading.

Redefined try_grab_folio() to try_grab_folio_fast(), and try_grab_page()
to try_grab_folio(), and use them in the proper paths.  This solves both
the abuse and the kernel warning.

The proper naming makes their usecase more clear and should prevent from
abusing in the future.

peterx said:

: The user will see the pin fails, for gpu-slow it further triggers the WARN
: right below that failure (as in the original report):
:
:         folio = try_grab_folio(page, page_increm - 1,
:                                 foll_flags);
:         if (WARN_ON_ONCE(!folio)) { <------------------------ here
:                 /*
:                         * Release the 1st page ref if the
:                         * folio is problematic, fail hard.
:                         */
:                 gup_put_folio(page_folio(page), 1,
:                                 foll_flags);
:                 ret = -EFAULT;
:                 goto out;
:         }

[1] https://lore.kernel.org/linux-mm/1719478388 [email protected]/

[[email protected]: fix implicit declaration of function try_grab_folio_fast]
Link: https://lkml.kernel.org/r/CAHbLzkowMSso-4Nufc9hcMehQsK9PNz3OSu-+eniU-2Mm-xjhA@mail.gmail.com
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"")
Signed-off-by: Yang Shi <[email protected]>
Reported-by: yangge <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: <[email protected]> [6.6+]
Signed-off-by: Andrew Morton <[email protected]>

Merge tag 'i2c-for-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull i2c fix from Wolfram Sang:
"An i2c driver fix"

* tag 'i2c-for-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: pnx: Fix potential deadlock warning from del_timer_sync() call in isr

selftests/powerpc: Fix build with USERCFLAGS set

Currently building the powerpc selftests with USERCFLAGS set to anything
causes the build to break:

  $ make -C tools/testing/selftests/powerpc V=1 USERCFLAGS=-Wno-error
  ...
  gcc -Wno-error    cache_shape.c ...
  cache_shape.c:18:10: fatal error: utils.h: No such file or directory
     18 | #include "utils.h"
        |          ^~~~~~~~~
  compilation terminated.

This happens because the USERCFLAGS are added to CFLAGS in lib.mk, which
causes the check of CFLAGS in powerpc/flags.mk to skip setting CFLAGS at
all, resulting in none of the usual CFLAGS being passed. That can
be seen in the output above, the only flag passed to the compiler is
-Wno-error.

Fix it by dropping the conditional setting of CFLAGS in flags.mk.
Instead always set CFLAGS, but also append USERCFLAGS if they are set.

Note that appending to CFLAGS (with +=) wouldn't work, because flags.mk
is included by multiple Makefiles (to support partial builds), causing
CFLAGS to be appended to multiple times. Additionally that would place
the USERCFLAGS prior to the standard CFLAGS, meaning the USERCFLAGS
couldn't override the standard flags. Being able to override the
standard flags is desirable, for example for adding -Wno-error.

With the fix in place, the CFLAGS are set correctly, including the
USERCFLAGS:

  $ make -C tools/testing/selftests/powerpc V=1 USERCFLAGS=-Wno-error
  ...
  gcc -std=gnu99 -O2 -Wall -Werror -DGIT_VERSION='"v6.10-rc2-7-gdea17e7e56c3"'
  -I/home/michael/linux/tools/testing/selftests/powerpc/include -Wno-error
  cache_shape.c ...

Fixes: 5553a79387e9 ("selftests/powerpc: Add flags.mk to support pmu buildable")
Signed-off-by: Michael Ellerman <[email protected]>
Link: https://msgid.link/[email protected]

hfsplus: fix uninit-value in copy_name

[syzbot reported]
BUG: KMSAN: uninit-value in sized_strscpy+0xc4/0x160
sized_strscpy+0xc4/0x160
copy_name+0x2af/0x320 fs/hfsplus/xattr.c:411
hfsplus_listxattr+0x11e9/0x1a50 fs/hfsplus/xattr.c:750
vfs_listxattr fs/xattr.c:493 [inline]
listxattr+0x1f3/0x6b0 fs/xattr.c:840
path_listxattr fs/xattr.c:864 [inline]
__do_sys_listxattr fs/xattr.c:876 [inline]
__se_sys_listxattr fs/xattr.c:873 [inline]
__x64_sys_listxattr+0x16b/0x2f0 fs/xattr.c:873
x64_sys_call+0x2ba0/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:195
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f

Uninit was created at:
slab_post_alloc_hook mm/slub.c:3877 [inline]
slab_alloc_node mm/slub.c:3918 [inline]
kmalloc_trace+0x57b/0xbe0 mm/slub.c:4065
kmalloc include/linux/slab.h:628 [inline]
hfsplus_listxattr+0x4cc/0x1a50 fs/hfsplus/xattr.c:699
vfs_listxattr fs/xattr.c:493 [inline]
listxattr+0x1f3/0x6b0 fs/xattr.c:840
path_listxattr fs/xattr.c:864 [inline]
__do_sys_listxattr fs/xattr.c:876 [inline]
__se_sys_listxattr fs/xattr.c:873 [inline]
__x64_sys_listxattr+0x16b/0x2f0 fs/xattr.c:873
x64_sys_call+0x2ba0/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:195
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
[Fix]
When allocating memory to strbuf, initialize memory to 0.

Reported-and-tested-by: [email protected]
Signed-off-by: Edward Adam Davis <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reported-and-tested-by: [email protected]
Signed-off-by: Christian Brauner <[email protected]>

tcp: fix incorrect undo caused by DSACK of TLP retransmit

Loss recovery undo_retrans bookkeeping had a long-standing bug where a
DSACK from a spurious TLP retransmit packet could cause an erroneous
undo of a fast recovery or RTO recovery that repaired a single
really-lost packet (in a sequence range outside that of the TLP
retransmit). Basically, because the loss recovery state machine didn't
account for the fact that it sent a TLP retransmit, the DSACK for the
TLP retransmit could erroneously be implicitly be interpreted as
corresponding to the normal fast recovery or RTO recovery retransmit
that plugged a real hole, thus resulting in an improper undo.

For example, consider the following buggy scenario where there is a
real packet loss but the congestion control response is improperly
undone because of this bug:

+ send packets P1, P2, P3, P4
+ P1 is really lost
+ send TLP retransmit of P4
+ receive SACK for original P2, P3, P4
+ enter fast recovery, fast-retransmit P1, increment undo_retrans to 1
+ receive DSACK for TLP P4, decrement undo_retrans to 0, undo (bug!)
+ receive cumulative ACK for P1-P4 (fast retransmit plugged real hole)

The fix: when we initialize undo machinery in tcp_init_undo(), if
there is a TLP retransmit in flight, then increment tp->undo_retrans
so that we make sure that we receive a DSACK corresponding to the TLP
retransmit, as well as DSACKs for all later normal retransmits, before
triggering a loss recovery undo. Note that we also have to move the
line that clears tp->tlp_high_seq for RTO recovery, so that upon RTO
we remember the tp->tlp_high_seq value until tcp_init_undo() and clear
it only afterward.

Also note that the bug dates back to the original 2013 TLP
implementation, commit 6ba8a3b19e76 ("tcp: Tail loss probe (TLP)").

However, this patch will only compile and work correctly with kernels
that have tp->tlp_retrans, which was added only in v5.8 in 2020 in
commit 76be93fc0702 ("tcp: allow at most one TLP probe per flight").
So we associate this fix with that later commit.

Fixes: 76be93fc0702 ("tcp: allow at most one TLP probe per flight")
Signed-off-by: Neal Cardwell <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Cc: Yuchung Cheng <[email protected]>
Cc: Kevin Yang <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge branch 'wireguard-fixes-for-6-10-rc7'

Jason A. Donenfeld says:

====================
wireguard fixes for 6.10-rc7

These are four small fixes for WireGuard, which are all marked for
stable:

1) A QEMU command line fix to remove deprecated flags.

2) Use of proper unaligned helpers to avoid unaligned memory access on
some systems, from Helge.

3) Two patches to annotate intentional data races, so KCSAN and syzbot
don't get upset.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

wireguard: send: annotate intentional data race in checking empty queue

KCSAN reports a race in wg_packet_send_keepalive, which is intentional:

    BUG: KCSAN: data-race in wg_packet_send_keepalive / wg_packet_send_staged_packets

    write to 0xffff88814cd91280 of 8 bytes by task 3194 on cpu 0:
     __skb_queue_head_init include/linux/skbuff.h:2162 [inline]
     skb_queue_splice_init include/linux/skbuff.h:2248 [inline]
     wg_packet_send_staged_packets+0xe5/0xad0 drivers/net/wireguard/send.c:351
     wg_xmit+0x5b8/0x660 drivers/net/wireguard/device.c:218
     __netdev_start_xmit include/linux/netdevice.h:4940 [inline]
     netdev_start_xmit include/linux/netdevice.h:4954 [inline]
     xmit_one net/core/dev.c:3548 [inline]
     dev_hard_start_xmit+0x11b/0x3f0 net/core/dev.c:3564
     __dev_queue_xmit+0xeff/0x1d80 net/core/dev.c:4349
     dev_queue_xmit include/linux/netdevice.h:3134 [inline]
     neigh_connected_output+0x231/0x2a0 net/core/neighbour.c:1592
     neigh_output include/net/neighbour.h:542 [inline]
     ip6_finish_output2+0xa66/0xce0 net/ipv6/ip6_output.c:137
     ip6_finish_output+0x1a5/0x490 net/ipv6/ip6_output.c:222
     NF_HOOK_COND include/linux/netfilter.h:303 [inline]
     ip6_output+0xeb/0x220 net/ipv6/ip6_output.c:243
     dst_output include/net/dst.h:451 [inline]
     NF_HOOK include/linux/netfilter.h:314 [inline]
     ndisc_send_skb+0x4a2/0x670 net/ipv6/ndisc.c:509
     ndisc_send_rs+0x3ab/0x3e0 net/ipv6/ndisc.c:719
     addrconf_dad_completed+0x640/0x8e0 net/ipv6/addrconf.c:4295
     addrconf_dad_work+0x891/0xbc0
     process_one_work kernel/workqueue.c:2633 [inline]
     process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2706
     worker_thread+0x525/0x730 kernel/workqueue.c:2787
     kthread+0x1d7/0x210 kernel/kthread.c:388
     ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147
     ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242

    read to 0xffff88814cd91280 of 8 bytes by task 3202 on cpu 1:
     skb_queue_empty include/linux/skbuff.h:1798 [inline]
     wg_packet_send_keepalive+0x20/0x100 drivers/net/wireguard/send.c:225
     wg_receive_handshake_packet drivers/net/wireguard/receive.c:186 [inline]
     wg_packet_handshake_receive_worker+0x445/0x5e0 drivers/net/wireguard/receive.c:213
     process_one_work kernel/workqueue.c:2633 [inline]
     process_scheduled_works+0x5b8/0xa30 kernel/workqueue.c:2706
     worker_thread+0x525/0x730 kernel/workqueue.c:2787
     kthread+0x1d7/0x210 kernel/kthread.c:388
     ret_from_fork+0x48/0x60 arch/x86/kernel/process.c:147
     ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242

    value changed: 0xffff888148fef200 -> 0xffff88814cd91280

Mark this race as intentional by using the skb_queue_empty_lockless()
function rather than skb_queue_empty(), which uses READ_ONCE()
internally to annotate the race.

Cc: [email protected]
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Signed-off-by: Jason A. Donenfeld <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

wireguard: queueing: annotate intentional data race in cpu round robin

KCSAN reports a race in the CPU round robin function, which, as the
comment points out, is intentional:

    BUG: KCSAN: data-race in wg_packet_send_staged_packets / wg_packet_send_staged_packets

    read to 0xffff88811254eb28 of 4 bytes by task 3160 on cpu 1:
     wg_cpumask_next_online drivers/net/wireguard/queueing.h:127 [inline]
     wg_queue_enqueue_per_device_and_peer drivers/net/wireguard/queueing.h:173 [inline]
     wg_packet_create_data drivers/net/wireguard/send.c:320 [inline]
     wg_packet_send_staged_packets+0x60e/0xac0 drivers/net/wireguard/send.c:388
     wg_packet_send_keepalive+0xe2/0x100 drivers/net/wireguard/send.c:239
     wg_receive_handshake_packet drivers/net/wireguard/receive.c:186 [inline]
     wg_packet_handshake_receive_worker+0x449/0x5f0 drivers/net/wireguard/receive.c:213
     process_one_work kernel/workqueue.c:3248 [inline]
     process_scheduled_works+0x483/0x9a0 kernel/workqueue.c:3329
     worker_thread+0x526/0x720 kernel/workqueue.c:3409
     kthread+0x1d1/0x210 kernel/kthread.c:389
     ret_from_fork+0x4b/0x60 arch/x86/kernel/process.c:147
     ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

    write to 0xffff88811254eb28 of 4 bytes by task 3158 on cpu 0:
     wg_cpumask_next_online drivers/net/wireguard/queueing.h:130 [inline]
     wg_queue_enqueue_per_device_and_peer drivers/net/wireguard/queueing.h:173 [inline]
     wg_packet_create_data drivers/net/wireguard/send.c:320 [inline]
     wg_packet_send_staged_packets+0x6e5/0xac0 drivers/net/wireguard/send.c:388
     wg_packet_send_keepalive+0xe2/0x100 drivers/net/wireguard/send.c:239
     wg_receive_handshake_packet drivers/net/wireguard/receive.c:186 [inline]
     wg_packet_handshake_receive_worker+0x449/0x5f0 drivers/net/wireguard/receive.c:213
     process_one_work kernel/workqueue.c:3248 [inline]
     process_scheduled_works+0x483/0x9a0 kernel/workqueue.c:3329
     worker_thread+0x526/0x720 kernel/workqueue.c:3409
     kthread+0x1d1/0x210 kernel/kthread.c:389
     ret_from_fork+0x4b/0x60 arch/x86/kernel/process.c:147
     ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

    value changed: 0xffffffff -> 0x00000000

Mark this race as intentional by using READ/WRITE_ONCE().

Cc: [email protected]
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Signed-off-by: Jason A. Donenfeld <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

wireguard: allowedips: avoid unaligned 64-bit memory accesses

On the parisc platform, the kernel issues kernel warnings because
swap_endian() tries to load a 128-bit IPv6 address from an unaligned
memory location:

Kernel: unaligned access to 0x55f4688c in wg_allowedips_insert_v6+0x2c/0x80 [wireguard] (iir 0xf3010df)
Kernel: unaligned access to 0x55f46884 in wg_allowedips_insert_v6+0x38/0x80 [wireguard] (iir 0xf2010dc)

Avoid such unaligned memory accesses by instead using the
get_unaligned_be64() helper macro.

Signed-off-by: Helge Deller <[email protected]>
[Jason: replace src[8] in original patch with src+8]
Cc: [email protected]
Fixes: e7096c131e51 ("net: WireGuard secure network tunnel")
Signed-off-by: Jason A. Donenfeld <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

wireguard: selftests: use acpi=off instead of -no-acpi for recent QEMU

QEMU 9.0 removed -no-acpi, in favor of machine properties, so update the
Makefile to use the correct QEMU invocation.

Cc: [email protected]
Fixes: b83fdcd9fb8a ("wireguard: selftests: use microvm on x86")
Signed-off-by: Jason A. Donenfeld <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net: bcmasp: Fix error code in probe()

Return an error code if bcmasp_interface_create() fails. Don't return
success.

Fixes: 490cb412007d ("net: bcmasp: Add support for ASP2.0 Ethernet controller")
Signed-off-by: Dan Carpenter <[email protected]>
Reviewed-by: Michal Kubiak <[email protected]>
Reviewed-by: Justin Chen <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

Merge tag 'integrity-v6.10-fix' of ssh://ra.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity

Pull integrity fix from Mimi Zohar:
"A single bug fix to properly remove all of the securityfs IMA
measurement lists"

* tag 'integrity-v6.10-fix' of ssh://ra.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
ima: fix wrong zero-assignment during securityfs dentry remove

selftests/vDSO: remove duplicate compiler invocations from Makefile

The Makefile open-codes compiler invocations that ../lib.mk already
provides.

Avoid this by using a Make feature that allows setting per-target
variables, which in this case are: CFLAGS and LDFLAGS. This approach
generates the exact same compiler invocations as before, but removes all
of the code duplication, along with the quirky mangled variable names.
So now the Makefile is smaller, less unusual, and easier to read.

The new dependencies are listed after including lib.mk, in order to
let lib.mk provide the first target ("all:"), and are grouped together
with their respective source file dependencies, for visual clarity.

Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Shuah Khan <[email protected]>

selftests/vDSO: remove partially duplicated "all:" target in Makefile

There were a couple of errors here:

1. TEST_GEN_PROGS was incorrectly prepending $(OUTPUT) to each program
to be built. However, lib.mk already does that because it assumes "bare"
program names are passed in, so this ended up creating
$(OUTPUT)/$(OUTPUT)/file.c, which of course won't work as intended.

2. lib.mk was included before TEST_GEN_PROGS was set, which led to
lib.mk's "all:" target not seeing anything to rebuild.

So nothing worked, which caused the author to force things by creating
an "all:" target locally--while still including ../lib.mk.

Fix all of this by including ../lib.mk at the right place, and removing
the $(OUTPUT) prefix to the programs to be built, and removing the
duplicate "all:" target.

Reviewed-by: Muhammad Usama Anjum <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Shuah Khan <[email protected]>

selftests/vDSO: fix clang build errors and warnings

When building with clang, via:

    make LLVM=1 -C tools/testing/selftests

...there are several warnings, and an error. This fixes all of those and
allows these tests to run and pass.

1. Fix linker error (undefined reference to memcpy) by providing a local
   version of memcpy.

2. clang complains about using this form:

    if (g = h & 0xf0000000)

...so factor out the assignment into a separate step.

3. The code is passing a signed const char* to elf_hash(), which expects
   a const unsigned char *. There are several callers, so fix this at
   the source by allowing the function to accept a signed argument, and
   then converting to unsigned operations, once inside the function.

4. clang doesn't have __attribute__((externally_visible)) and generates
   a warning to that effect. Fortunately, gcc 12 and gcc 13 do not seem
   to require that attribute in order to build, run and pass tests here,
   so remove it.

Reviewed-by: Carlos Llamas <[email protected]>
Reviewed-by: Edward Liaw <[email protected]>
Reviewed-by: Muhammad Usama Anjum <[email protected]>
Tested-by: Muhammad Usama Anjum <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Shuah Khan <[email protected]>

Merge tag 'pci-v6.10-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

Pull pci update from Bjorn Helgaas:

- Update MAINTAINERS and CREDITS to credit Gustavo Pimentel with the
   Synopsys DesignWare eDMA driver and reflect that he is no longer at
   Synopsys and isn't in a position to maintain the DesignWare xData
   traffic generator (Bjorn Helgaas)

* tag 'pci-v6.10-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
  CREDITS: Add Synopsys DesignWare eDMA driver for Gustavo Pimentel
  MAINTAINERS: Orphan Synopsys DesignWare xData traffic generator

Merge tag 'riscv-for-linus-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:

- A fix for the CMODX example in the recently added icache flushing
   prctl()

- A fix to the perf driver to avoid corrupting event data on counter
   overflows when external overflow handlers are in use

- A fix to clear all hardware performance monitor events on boot, to
   avoid dangling events firmware or previously booted kernels from
   triggering spuriously

- A fix to the perf event probing logic to avoid erroneously reporting
   the presence of unimplemented counters. This also prevents some
   implemented counters from being reported

- A build fix for the vector sigreturn selftest on clang

- A fix to ftrace, which now requires the previously optional index
   argument to ftrace_graph_ret_addr()

- A fix to avoid deadlocking if kexec crash handling triggers in an
   interrupt context

* tag 'riscv-for-linus-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: kexec: Avoid deadlock in kexec crash path
  riscv: stacktrace: fix usage of ftrace_graph_ret_addr()
  riscv: selftests: Fix vsetivli args for clang
  perf: RISC-V: Check standard event availability
  drivers/perf: riscv: Reset the counter to hpmevent mapping while starting cpus
  drivers/perf: riscv: Do not update the event data if uptodate
  documentation: Fix riscv cmodx example

selftest/timerns: fix clang build failures for abs() calls

When building with clang, via:

make LLVM=1 -C tools/testing/selftests

...clang warns about mismatches between the expected and required
integer length being supplied to abs(3).

Fix this by using the correct variant of abs(3): labs(3) or llabs(3), in
these cases.

Reviewed-by: Dmitry Safonov <[email protected]>
Reviewed-by: Muhammad Usama Anjum <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Acked-by: Andrei Vagin <[email protected]>
Signed-off-by: Shuah Khan <[email protected]>

Merge tag 'drm-fixes-2024-07-05' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Daniel Vetter:
"Just small fixes all over here, all quiet as it should.

  drivers:

   - amd: mostly amdgpu display fixes + radeon vm NULL deref fix

   - xe: migration error handling + typoed register name in gt setup

   - i915: usb-c fix to shut up warnings on MTL+

   - panthor: fix sync-only jobs + ioctl validation fix to not EINVAL
     wrongly

   - panel quirks

   - nouveau: NULL deref in get_modes

  drm core:

   - fbdev big endian fix for the dma memory backed variant

  drivers/firmware:

   - fix sysfb refcounting"

* tag 'drm-fixes-2024-07-05' of https://gitlab.freedesktop.org/drm/kernel:
  drm/xe/mcr: Avoid clobbering DSS steering
  drm/xe: fix error handling in xe_migrate_update_pgtables
  drm/ttm: Always take the bo delayed cleanup path for imported bos
  drm/fbdev-generic: Fix framebuffer on big endian devices
  drm/panthor: Fix sync-only jobs
  drm/panthor: Don't check the array stride on empty uobj arrays
  drm/amdgpu/atomfirmware: silence UBSAN warning
  drm/radeon: check bo_va->bo is non-NULL before using it
  drm/amd/display: Fix array-index-out-of-bounds in dml2/FCLKChangeSupport
  drm/amd/display: Update efficiency bandwidth for dcn351
  drm/amd/display: Fix refresh rate range for some panel
  drm/amd/display: Account for cursor prefetch BW in DML1 mode support
  drm/amd/display: Add refresh rate range check
  drm/amd/display: Reset freesync config before update new state
  drm: panel-orientation-quirks: Add labels for both Valve Steam Deck revisions
  drm: panel-orientation-quirks: Add quirk for Valve Galileo
  drm/i915/display: For MTL+ platforms skip mg dp programming
  drm/nouveau: fix null pointer dereference in nouveau_connector_get_modes
  firmware: sysfb: Fix reference count of sysfb parent device

Merge tag 'gpio-fixes-for-v6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux

Pull gpio fixes from Bartosz Golaszewski:
"Two OF lookup quirks and one fix for an issue in the generic gpio-mmio
  driver:

   - add two OF lookup quirks for TSC2005 and MIPS Lantiq

   - don't try to figure out bgpio_bits from the 'ngpios' property in
     gpio-mmio"

* tag 'gpio-fixes-for-v6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  gpiolib: of: add polarity quirk for TSC2005
  gpio: mmio: do not calculate bgpio_bits via "ngpios"
  gpiolib: of: fix lookup quirk for MIPS Lantiq

Merge tag 'tpmdd-next-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd

Pull TPM fixes from Jarkko Sakkinen:
"This contains the fixes for !chip->auth condition, preventing the
  breakage of:
   - tpm_ftpm_tee.c
   - tpm_i2c_nuvoton.c
   - tpm_ibmvtpm.c
   - tpm_tis_i2c_cr50.c
   - tpm_vtpm_proxy.c

  All drivers will continue to work as they did in 6.9, except a single
  warning (dev_warn() not WARN()) is printed to klog only to inform that
  authenticated sessions are not enabled"

* tag 'tpmdd-next-6.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
  tpm: Address !chip->auth in tpm_buf_append_hmac_session*()
  tpm: Address !chip->auth in tpm_buf_append_name()
  tpm: Address !chip->auth in tpm2_*_auth_session()

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fix from Paolo Bonzini:

- s390: fix support for z16 systems

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: s390: fix LPSWEY handling

vfs: don't mod negative dentry count when on shrinker list

The nr_dentry_negative counter is intended to only account negative
dentries that are present on the superblock LRU. Therefore, the LRU
add, remove and isolate helpers modify the counter based on whether
the dentry is negative, but the shrinker list related helpers do not
modify the counter, and the paths that change a dentry between
positive and negative only do so if DCACHE_LRU_LIST is set.

The problem with this is that a dentry on a shrinker list still has
DCACHE_LRU_LIST set to indicate ->d_lru is in use. The additional
DCACHE_SHRINK_LIST flag denotes whether the dentry is on LRU or a
shrink related list. Therefore if a relevant operation (i.e. unlink)
occurs while a dentry is present on a shrinker list, and the
associated codepath only checks for DCACHE_LRU_LIST, then it is
technically possible to modify the negative dentry count for a
dentry that is off the LRU. Since the shrinker list related helpers
do not modify the negative dentry count (because non-LRU dentries
should not be included in the count) when the dentry is ultimately
removed from the shrinker list, this can cause the negative dentry
count to become permanently inaccurate.

This problem can be reproduced via a heavy file create/unlink vs.
drop_caches workload. On an 80xcpu system, I start 80 tasks each
running a 1k file create/delete loop, and one task spinning on
drop_caches. After 10 minutes or so of runtime, the idle/clean cache
negative dentry count increases from somewhere in the range of 5-10
entries to several hundred (and increasingly grows beyond
nr_dentry_unused).

Tweak the logic in the paths that turn a dentry negative or positive
to filter out the case where the dentry is present on a shrink
related list. This allows the above workload to maintain an accurate
negative dentry count.

Fixes: af0c9af1b3f6 ("fs/dcache: Track & report number of negative dentries")
Signed-off-by: Brian Foster <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Acked-by: Ian Kent <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: Waiman Long <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

filelock: fix potential use-after-free in posix_lock_inode

Light Hsieh reported a KASAN UAF warning in trace_posix_lock_inode().
The request pointer had been changed earlier to point to a lock entry
that was added to the inode's list. However, before the tracepoint could
fire, another task raced in and freed that lock.

Fix this by moving the tracepoint inside the spinlock, which should
ensure that this doesn't happen.

Fixes: 74f6f5912693 ("locks: fix KASAN: use-after-free in trace_event_raw_event_filelock_lock")
Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
Reported-by: Light Hsieh (謝明燈) <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Alexander Aring <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>

Merge patch series "cachefiles: random bugfixes"

[email protected] <[email protected]> says:

This is the third version of this patch series, in which another patch set
is subsumed into this one to avoid confusing the two patch sets.
(https://patchwork.kernel.org/project/linux-fsdevel/list/?series=854914)

We've been testing ondemand mode for cachefiles since January, and we're
almost done. We hit a lot of issues during the testing period, and this
patch series fixes some of the issues. The patches have passed internal
testing without regression.

The following is a brief overview of the patches, see the patches for
more details.

Patch 1-2: Add fscache_try_get_volume() helper function to avoid
fscache_volume use-after-free on cache withdrawal.

Patch 3: Fix cachefiles_lookup_cookie() and cachefiles_withdraw_cache()
concurrency causing cachefiles_volume use-after-free.

Patch 4: Propagate error codes returned by vfs_getxattr() to avoid
endless loops.

Patch 5-7: A read request waiting for reopen could be closed maliciously
before the reopen worker is executing or waiting to be scheduled. So
ondemand_object_worker() may be called after the info and object and even
the cache have been freed and trigger use-after-free. So use
cancel_work_sync() in cachefiles_ondemand_clean_object() to cancel the
reopen worker or wait for it to finish. Since it makes no sense to wait
for the daemon to complete the reopen request, to avoid this pointless
operation blocking cancel_work_sync(), Patch 1 avoids request generation
by the DROPPING state when the request has not been sent, and Patch 2
flushes the requests of the current object before cancel_work_sync().

Patch 8: Cyclic allocation of msg_id to avoid msg_id reuse misleading
the daemon to cause hung.

Patch 9: Hold xas_lock during polling to avoid dereferencing reqs causing
use-after-free. This issue was triggered frequently in our tests, and we
found that anolis 5.10 had fixed it. So to avoid failing the test, this
patch is pushed upstream as well.

Baokun Li (7):
  netfs, fscache: export fscache_put_volume() and add
    fscache_try_get_volume()
  cachefiles: fix slab-use-after-free in fscache_withdraw_volume()
  cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie()
  cachefiles: propagate errors from vfs_getxattr() to avoid infinite
    loop
  cachefiles: stop sending new request when dropping object
  cachefiles: cancel all requests for the object that is being dropped
  cachefiles: cyclic allocation of msg_id to avoid reuse

Hou Tao (1):
  cachefiles: wait for ondemand_object_worker to finish when dropping
    object

Jingbo Xu (1):
  cachefiles: add missing lock protection when polling

fs/cachefiles/cache.c          | 45 ++++++++++++++++++++++++++++-
fs/cachefiles/daemon.c         |  4 +--
fs/cachefiles/internal.h       |  3 ++
fs/cachefiles/ondemand.c       | 52 ++++++++++++++++++++++++++++++----
fs/cachefiles/volume.c         |  1 -
fs/cachefiles/xattr.c          |  5 +++-
fs/netfs/fscache_volume.c      | 14 +++++++++
fs/netfs/internal.h            |  2 --
include/linux/fscache-cache.h  |  6 ++++
include/trace/events/fscache.h |  4 +++
10 files changed, 123 insertions(+), 13 deletions(-)

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Christian Brauner <[email protected]>