Git Repo - linux.git/log

virtio_net: use control_buf for coalesce params

Commit 699b045a8e43 ("net: virtio_net: notifications coalescing
support") added coalescing command support for virtio_net. However,
the coalesce commands are using buffers on the stack, which is causing
the device to see DMA errors. There should also be a complaint from
check_for_stack() in debug_dma_map_xyz(). Fix this by adding and using
coalesce params from the control_buf struct, which aligns with other
commands.

Cc: [email protected]
Fixes: 699b045a8e43 ("net: virtio_net: notifications coalescing support")
Reviewed-by: Shannon Nelson <[email protected]>
Signed-off-by: Allen Hubbe <[email protected]>
Signed-off-by: Brett Creeley <[email protected]>
Acked-by: Jason Wang <[email protected]>
Reviewed-by: Xuan Zhuo <[email protected]>
Acked-by: Michael S. Tsirkin <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

pds_core: Fix FW recovery detection

Commit 523847df1b37 ("pds_core: add devcmd device interfaces") included
initial support for FW recovery detection. Unfortunately, the ordering
in pdsc_is_fw_good() was incorrect, which was causing FW recovery to be
undetected by the driver. Fix this by making sure to update the cached
fw_status by calling pdsc_is_fw_running() before setting the local FW
gen.

Fixes: 523847df1b37 ("pds_core: add devcmd device interfaces")
Signed-off-by: Shannon Nelson <[email protected]>
Signed-off-by: Brett Creeley <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

tcp: gso: really support BIG TCP

We missed that tcp_gso_segment() was assuming skb->len was smaller than 65535 :

oldlen = (u16)~skb->len;

This part came with commit 0718bcc09b35 ("[NET]: Fix CHECKSUM_HW GSO problems.")

This leads to wrong TCP checksum.

Adapt the code to accept arbitrary packet length.

v2:
- use two csum_add() instead of csum_fold() (Alexander Duyck)
- Change delta type to __wsum to reduce casts (Alexander Duyck)

Fixes: 09f3d1a3a52c ("ipv6/gso: remove temporary HBH/jumbo header")
Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Alexander Duyck <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

ipv6: rpl: Fix Route of Death.

A remote DoS vulnerability of RPL Source Routing is assigned CVE-2023-2156.

The Source Routing Header (SRH) has the following format:

  0                   1                   2                   3
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |  Next Header  |  Hdr Ext Len  | Routing Type  | Segments Left |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  | CmprI | CmprE |  Pad  |               Reserved                |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |                                                               |
  .                                                               .
  .                        Addresses[1..n]                        .
  .                                                               .
  |                                                               |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The originator of an SRH places the first hop's IPv6 address in the IPv6
header's IPv6 Destination Address and the second hop's IPv6 address as
the first address in Addresses[1..n].

The CmprI and CmprE fields indicate the number of prefix octets that are
shared with the IPv6 Destination Address.  When CmprI or CmprE is not 0,
Addresses[1..n] are compressed as follows:

  1..n-1 : (16 - CmprI) bytes
       n : (16 - CmprE) bytes

Segments Left indicates the number of route segments remaining.  When the
value is not zero, the SRH is forwarded to the next hop.  Its address
is extracted from Addresses[n - Segment Left + 1] and swapped with IPv6
Destination Address.

When Segment Left is greater than or equal to 2, the size of SRH is not
changed because Addresses[1..n-1] are decompressed and recompressed with
CmprI.

OTOH, when Segment Left changes from 1 to 0, the new SRH could have a
different size because Addresses[1..n-1] are decompressed with CmprI and
recompressed with CmprE.

Let's say CmprI is 15 and CmprE is 0.  When we receive SRH with Segment
Left >= 2, Addresses[1..n-1] have 1 byte for each, and Addresses[n] has
16 bytes.  When Segment Left is 1, Addresses[1..n-1] is decompressed to
16 bytes and not recompressed.  Finally, the new SRH will need more room
in the header, and the size is (16 - 1) * (n - 1) bytes.

Here the max value of n is 255 as Segment Left is u8, so in the worst case,
we have to allocate 3825 bytes in the skb headroom.  However, now we only
allocate a small fixed buffer that is IPV6_RPL_SRH_WORST_SWAP_SIZE (16 + 7
bytes).  If the decompressed size overflows the room, skb_push() hits BUG()
below [0].

Instead of allocating the fixed buffer for every packet, let's allocate
enough headroom only when we receive SRH with Segment Left 1.

[0]:
skbuff: skb_under_panic: text:ffffffff81c9f6e2 len:576 put:576 head:ffff8880070b5180 data:ffff8880070b4fb0 tail:0x70 end:0x140 dev:lo
kernel BUG at net/core/skbuff.c:200!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 154 Comm: python3 Not tainted 6.4.0-rc4-00190-gc308e9ec0047 #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:skb_panic (net/core/skbuff.c:200)
Code: 4f 70 50 8b 87 bc 00 00 00 50 8b 87 b8 00 00 00 50 ff b7 c8 00 00 00 4c 8b 8f c0 00 00 00 48 c7 c7 80 6e 77 82 e8 ad 8b 60 ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
RSP: 0018:ffffc90000003da0 EFLAGS: 00000246
RAX: 0000000000000085 RBX: ffff8880058a6600 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88807dc1c540 RDI: ffff88807dc1c540
RBP: ffffc90000003e48 R08: ffffffff82b392c8 R09: 00000000ffffdfff
R10: ffffffff82a592e0 R11: ffffffff82b092e0 R12: ffff888005b1c800
R13: ffff8880070b51b8 R14: ffff888005b1ca18 R15: ffff8880070b5190
FS:  00007f4539f0b740(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055670baf3000 CR3: 0000000005b0e000 CR4: 00000000007506f0
PKRU: 55555554
Call Trace:
<IRQ>
skb_push (net/core/skbuff.c:210)
ipv6_rthdr_rcv (./include/linux/skbuff.h:2880 net/ipv6/exthdrs.c:634 net/ipv6/exthdrs.c:718)
ip6_protocol_deliver_rcu (net/ipv6/ip6_input.c:437 (discriminator 5))
ip6_input_finish (./include/linux/rcupdate.h:805 net/ipv6/ip6_input.c:483)
__netif_receive_skb_one_core (net/core/dev.c:5494)
process_backlog (./include/linux/rcupdate.h:805 net/core/dev.c:5934)
__napi_poll (net/core/dev.c:6496)
net_rx_action (net/core/dev.c:6565 net/core/dev.c:6696)
__do_softirq (./arch/x86/include/asm/jump_label.h:27 ./include/linux/jump_label.h:207 ./include/trace/events/irq.h:142 kernel/softirq.c:572)
do_softirq (kernel/softirq.c:472 kernel/softirq.c:459)
</IRQ>
<TASK>
__local_bh_enable_ip (kernel/softirq.c:396)
__dev_queue_xmit (net/core/dev.c:4272)
ip6_finish_output2 (./include/net/neighbour.h:544 net/ipv6/ip6_output.c:134)
rawv6_sendmsg (./include/net/dst.h:458 ./include/linux/netfilter.h:303 net/ipv6/raw.c:656 net/ipv6/raw.c:914)
sock_sendmsg (net/socket.c:724 net/socket.c:747)
__sys_sendto (net/socket.c:2144)
__x64_sys_sendto (net/socket.c:2156 net/socket.c:2152 net/socket.c:2152)
do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
RIP: 0033:0x7f453a138aea
Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89
RSP: 002b:00007ffcc212a1c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00007ffcc212a288 RCX: 00007f453a138aea
RDX: 0000000000000060 RSI: 00007f4539084c20 RDI: 0000000000000003
RBP: 00007f4538308e80 R08: 00007ffcc212a300 R09: 000000000000001c
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: ffffffffc4653600 R14: 0000000000000001 R15: 00007f4539712d1b
</TASK>
Modules linked in:

Fixes: 8610c7c6e3bd ("net: ipv6: add support for rpl sr exthdr")
Reported-by: Max VA
Closes: https://www.interruptlabs.co.uk/articles/linux-ipv6-route-of-death
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

netlink: specs: ethtool: fix random typos

Working on the code gen for C reveals typos in the ethtool spec
as the compiler tries to find the names in the existing uAPI
header. Fix the mistakes.

Fixes: a353318ebf24 ("tools: ynl: populate most of the ethtool spec")
Acked-by: Stanislav Fomichev <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

netfilter: nf_tables: out-of-bound check in chain blob

Add current size of rule expressions to the boundary check.

Fixes: 2c865a8a28a1 ("netfilter: nf_tables: add rule blob layout")
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: ipset: Add schedule point in call_ad().

syzkaller found a repro that causes Hung Task [0] with ipset.  The repro
first creates an ipset and then tries to delete a large number of IPs
from the ipset concurrently:

  IPSET_ATTR_IPADDR_IPV4 : 172.20.20.187
  IPSET_ATTR_CIDR        : 2

The first deleting thread hogs a CPU with nfnl_lock(NFNL_SUBSYS_IPSET)
held, and other threads wait for it to be released.

Previously, the same issue existed in set->variant->uadt() that could run
so long under ip_set_lock(set).  Commit 5e29dc36bd5e ("netfilter: ipset:
Rework long task execution when adding/deleting entries") tried to fix it,
but the issue still exists in the caller with another mutex.

While adding/deleting many IPs, we should release the CPU periodically to
prevent someone from abusing ipset to hang the system.

Note we need to increment the ipset's refcnt to prevent the ipset from
being destroyed while rescheduling.

[0]:
INFO: task syz-executor174:268 blocked for more than 143 seconds.
      Not tainted 6.4.0-rc1-00145-gba79e9a73284 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz-executor174 state:D stack:0     pid:268   ppid:260    flags:0x0000000d
Call trace:
__switch_to+0x308/0x714 arch/arm64/kernel/process.c:556
context_switch kernel/sched/core.c:5343 [inline]
__schedule+0xd84/0x1648 kernel/sched/core.c:6669
schedule+0xf0/0x214 kernel/sched/core.c:6745
schedule_preempt_disabled+0x58/0xf0 kernel/sched/core.c:6804
__mutex_lock_common kernel/locking/mutex.c:679 [inline]
__mutex_lock+0x6fc/0xdb0 kernel/locking/mutex.c:747
__mutex_lock_slowpath+0x14/0x20 kernel/locking/mutex.c:1035
mutex_lock+0x98/0xf0 kernel/locking/mutex.c:286
nfnl_lock net/netfilter/nfnetlink.c:98 [inline]
nfnetlink_rcv_msg+0x480/0x70c net/netfilter/nfnetlink.c:295
netlink_rcv_skb+0x1c0/0x350 net/netlink/af_netlink.c:2546
nfnetlink_rcv+0x18c/0x199c net/netfilter/nfnetlink.c:658
netlink_unicast_kernel net/netlink/af_netlink.c:1339 [inline]
netlink_unicast+0x664/0x8cc net/netlink/af_netlink.c:1365
netlink_sendmsg+0x6d0/0xa4c net/netlink/af_netlink.c:1913
sock_sendmsg_nosec net/socket.c:724 [inline]
sock_sendmsg net/socket.c:747 [inline]
____sys_sendmsg+0x4b8/0x810 net/socket.c:2503
___sys_sendmsg net/socket.c:2557 [inline]
__sys_sendmsg+0x1f8/0x2a4 net/socket.c:2586
__do_sys_sendmsg net/socket.c:2595 [inline]
__se_sys_sendmsg net/socket.c:2593 [inline]
__arm64_sys_sendmsg+0x80/0x94 net/socket.c:2593
__invoke_syscall arch/arm64/kernel/syscall.c:38 [inline]
invoke_syscall+0x84/0x270 arch/arm64/kernel/syscall.c:52
el0_svc_common+0x134/0x24c arch/arm64/kernel/syscall.c:142
do_el0_svc+0x64/0x198 arch/arm64/kernel/syscall.c:193
el0_svc+0x2c/0x7c arch/arm64/kernel/entry-common.c:637
el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:655
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:591

Reported-by: syzkaller <[email protected]>
Fixes: a7b4f989a629 ("netfilter: ipset: IP set core support")
Signed-off-by: Kuniyuki Iwashima <[email protected]>
Acked-by: Jozsef Kadlecsik <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: conntrack: fix NULL pointer dereference in nf_confirm_cthelper

An nf_conntrack_helper from nf_conn_help may become NULL after DNAT.

Observed when TCP port 1720 (Q931_PORT), associated with h323 conntrack
helper, is DNAT'ed to another destination port (e.g. 1730), while
nfqueue is being used for final acceptance (e.g. snort).

This happenned after transition from kernel 4.14 to 5.10.161.

Workarounds:
* keep the same port (1720) in DNAT
* disable nfqueue
* disable/unload h323 NAT helper

$ linux-5.10/scripts/decode_stacktrace.sh vmlinux < /tmp/kernel.log
BUG: kernel NULL pointer dereference, address: 0000000000000084
[..]
RIP: 0010:nf_conntrack_update (net/netfilter/nf_conntrack_core.c:2080 net/netfilter/nf_conntrack_core.c:2134) nf_conntrack
[..]
nfqnl_reinject (net/netfilter/nfnetlink_queue.c:237) nfnetlink_queue
nfqnl_recv_verdict (net/netfilter/nfnetlink_queue.c:1230) nfnetlink_queue
nfnetlink_rcv_msg (net/netfilter/nfnetlink.c:241) nfnetlink
[..]

Fixes: ee04805ff54a ("netfilter: conntrack: make conntrack userspace helpers work again")
Signed-off-by: Tijs Van Buggenhout <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nft_bitwise: fix register tracking

At the end of `nft_bitwise_reduce`, there is a loop which is intended to
update the bitwise expression associated with each tracked destination
register. However, currently, it just updates the first register
repeatedly. Fix it.

Fixes: 34cc9e52884a ("netfilter: nf_tables: cancel tracking for clobbered destination registers")
Signed-off-by: Jeremy Sowden <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

netfilter: nf_tables: Add null check for nla_nest_start_noflag() in nft_dump_basechain_hook()

The nla_nest_start_noflag() function may fail and return NULL;
the return value needs to be checked.

Found by InfoTeCS on behalf of Linux Verification Center
(linuxtesting.org) with SVACE.

Fixes: d54725cd11a5 ("netfilter: nf_tables: support for multiple devices per netdev hook")
Signed-off-by: Gavrilov Ilia <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

selftests/bpf: Fix sockopt_sk selftest

Commit f4e4534850a9 ("net/netlink: fix NETLINK_LIST_MEMBERSHIPS length report")
fixed NETLINK_LIST_MEMBERSHIPS length report which caused
selftest sockopt_sk failure. The failure log looks like

  test_sockopt_sk:PASS:join_cgroup /sockopt_sk 0 nsec
  run_test:PASS:skel_load 0 nsec
  run_test:PASS:setsockopt_link 0 nsec
  run_test:PASS:getsockopt_link 0 nsec
  getsetsockopt:FAIL:Unexpected NETLINK_LIST_MEMBERSHIPS value unexpected Unexpected NETLINK_LIST_MEMBERSHIPS value: actual 8 != expected 4
  run_test:PASS:getsetsockopt 0 nsec
  #201     sockopt_sk:FAIL

In net/netlink/af_netlink.c, function netlink_getsockopt(), for NETLINK_LIST_MEMBERSHIPS,
nlk->ngroups equals to 36. Before Commit f4e4534850a9, the optlen is calculated as
  ALIGN(nlk->ngroups / 8, sizeof(u32)) = 4
After that commit, the optlen is
  ALIGN(BITS_TO_BYTES(nlk->ngroups), sizeof(u32)) = 8

Fix the test by setting the expected optlen to be 8.

Fixes: f4e4534850a9 ("net/netlink: fix NETLINK_LIST_MEMBERSHIPS length report")
Signed-off-by: Yonghong Song <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/[email protected]

Merge tag 'spi-fix-v6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi

Pull spi fixes from Mark Brown:
"A small collection of driver specific fixes, none of them particularly
  remarkable or severe"

* tag 'spi-fix-v6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: qup: Request DMA before enabling clocks
  spi: mt65xx: make sure operations completed before unloading
  spi: lpspi: disable lpspi module irq in DMA mode

wifi: cfg80211: fix locking in regulatory disconnect

This should use wiphy_lock() now instead of requiring the
RTNL, since __cfg80211_leave() via cfg80211_leave() is now
requiring that lock to be held.

Fixes: a05829a7222e ("cfg80211: avoid holding the RTNL when calling the driver")
Signed-off-by: Johannes Berg <[email protected]>

wifi: cfg80211: fix locking in sched scan stop work

This should use wiphy_lock() now instead of acquiring the
RTNL, since cfg80211_stop_sched_scan_req() now needs that.

Fixes: a05829a7222e ("cfg80211: avoid holding the RTNL when calling the driver")
Signed-off-by: Johannes Berg <[email protected]>

Merge tag 'gfs2-v6.4-rc4-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fix from Andreas Gruenbacher:

- Don't get stuck writing page onto itself under direct I/O

* tag 'gfs2-v6.4-rc4-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Don't get stuck writing page onto itself under direct I/O

Merge tag 'platform-drivers-x86-v6.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform driver fixes from Hans de Goede:

- various Microsoft Surface support fixes

- one fix for the INT3472 driver

* tag 'platform-drivers-x86-v6.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
  platform/x86: int3472: Avoid crash in unregistering regulator gpio
  platform/surface: aggregator_tabletsw: Add support for book mode in POS subsystem
  platform/surface: aggregator_tabletsw: Add support for book mode in KIP subsystem
  platform/surface: aggregator: Allow completion work-items to be executed in parallel
  platform/surface: aggregator: Make to_ssam_device_driver() respect constness

qed/qede: Fix scheduling while atomic

Statistics read through bond interface via sysfs causes
below bug and traces as it triggers the bonding module to
collect the slave device statistics while holding the spinlock,
beneath that qede->qed driver statistics flow gets scheduled out
due to usleep_range() used in PTT acquire logic

[ 3673.988874] Hardware name: HPE ProLiant DL365 Gen10 Plus/ProLiant DL365 Gen10 Plus, BIOS A42 10/29/2021
[ 3673.988878] Call Trace:
[ 3673.988891]  dump_stack_lvl+0x34/0x44
[ 3673.988908]  __schedule_bug.cold+0x47/0x53
[ 3673.988918]  __schedule+0x3fb/0x560
[ 3673.988929]  schedule+0x43/0xb0
[ 3673.988932]  schedule_hrtimeout_range_clock+0xbf/0x1b0
[ 3673.988937]  ? __hrtimer_init+0xc0/0xc0
[ 3673.988950]  usleep_range+0x5e/0x80
[ 3673.988955]  qed_ptt_acquire+0x2b/0xd0 [qed]
[ 3673.988981]  _qed_get_vport_stats+0x141/0x240 [qed]
[ 3673.989001]  qed_get_vport_stats+0x18/0x80 [qed]
[ 3673.989016]  qede_fill_by_demand_stats+0x37/0x400 [qede]
[ 3673.989028]  qede_get_stats64+0x19/0xe0 [qede]
[ 3673.989034]  dev_get_stats+0x5c/0xc0
[ 3673.989045]  netstat_show.constprop.0+0x52/0xb0
[ 3673.989055]  dev_attr_show+0x19/0x40
[ 3673.989065]  sysfs_kf_seq_show+0x9b/0xf0
[ 3673.989076]  seq_read_iter+0x120/0x4b0
[ 3673.989087]  new_sync_read+0x118/0x1a0
[ 3673.989095]  vfs_read+0xf3/0x180
[ 3673.989099]  ksys_read+0x5f/0xe0
[ 3673.989102]  do_syscall_64+0x3b/0x90
[ 3673.989109]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3673.989115] RIP: 0033:0x7f8467d0b082
[ 3673.989119] Code: c0 e9 b2 fe ff ff 50 48 8d 3d ca 05 08 00 e8 35 e7 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
[ 3673.989121] RSP: 002b:00007ffffb21fd08 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 3673.989127] RAX: ffffffffffffffda RBX: 000000000100eca0 RCX: 00007f8467d0b082
[ 3673.989128] RDX: 00000000000003ff RSI: 00007ffffb21fdc0 RDI: 0000000000000003
[ 3673.989130] RBP: 00007f8467b96028 R08: 0000000000000010 R09: 00007ffffb21ec00
[ 3673.989132] R10: 00007ffffb27b170 R11: 0000000000000246 R12: 00000000000000f0
[ 3673.989134] R13: 0000000000000003 R14: 00007f8467b92000 R15: 0000000000045a05
[ 3673.989139] CPU: 30 PID: 285188 Comm: read_all Kdump: loaded Tainted: G        W  OE

Fix this by collecting the statistics asynchronously from a periodic
delayed work scheduled at default stats coalescing interval and return
the recent copy of statisitcs from .ndo_get_stats64(), also add ability
to configure/retrieve stats coalescing interval using below commands -

ethtool -C ethx stats-block-usecs <val>
ethtool -c ethx

Fixes: 133fac0eedc3 ("qede: Add basic ethtool support")
Cc: Sudarsana Kalluru <[email protected]>
Cc: David Miller <[email protected]>
Signed-off-by: Manish Chopra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>

Merge tag 'for-linus-2023060501' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid

Pull HID fix from Jiri Kosina:

- Final, confirmed fix for regression causing some devices connected
   via Logitech HID++ Unifying receiver take too long to initialize
   (Benjamin Tissoires)

* tag 'for-linus-2023060501' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
  HID: hidpp: terminate retry loop on success

wifi: iwlwifi: mvm: Fix -Warray-bounds bug in iwl_mvm_wait_d3_notif()

kmemdup() at line 2735 is not duplicating enough memory for
notif->tid_tear_down and notif->station_id. As it only duplicates
612 bytes: up to offsetofend(struct iwl_wowlan_info_notif,
received_beacons), this is the range of [0, 612) bytes.

2735 notif = kmemdup(notif_v1,
2736 offsetofend(struct iwl_wowlan_info_notif,
2737     received_beacons),
2738 GFP_ATOMIC);

which evidently does not cover bytes 612 and 613 for members
tid_tear_down and station_id in struct iwl_wowlan_info_notif.
See below:

$ pahole -C iwl_wowlan_info_notif drivers/net/wireless/intel/iwlwifi/mvm/d3.o
struct iwl_wowlan_info_notif {
struct iwl_wowlan_gtk_status_v3 gtk[2];          /*     0   488 */
/* --- cacheline 7 boundary (448 bytes) was 40 bytes ago --- */
struct iwl_wowlan_igtk_status igtk[2];           /*   488    80 */
/* --- cacheline 8 boundary (512 bytes) was 56 bytes ago --- */
__le64                     replay_ctr;           /*   568     8 */
/* --- cacheline 9 boundary (576 bytes) --- */
__le16                     pattern_number;       /*   576     2 */
__le16                     reserved1;            /*   578     2 */
__le16                     qos_seq_ctr[8];       /*   580    16 */
__le32                     wakeup_reasons;       /*   596     4 */
__le32                     num_of_gtk_rekeys;    /*   600     4 */
__le32                     transmitted_ndps;     /*   604     4 */
__le32                     received_beacons;     /*   608     4 */
u8                         tid_tear_down;        /*   612     1 */
u8                         station_id;           /*   613     1 */
u8                         reserved2[2];         /*   614     2 */

/* size: 616, cachelines: 10, members: 13 */
/* last cacheline: 40 bytes */
};

Therefore, when the following assignments take place, actually no memory
has been allocated for those objects:

2743 notif->tid_tear_down = notif_v1->tid_tear_down;
2744 notif->station_id = notif_v1->station_id;

Fix this by allocating space for the whole notif object and zero out the
remaining space in memory after member station_id.

This also fixes the following -Warray-bounds issues:
CC      drivers/net/wireless/intel/iwlwifi/mvm/d3.o
drivers/net/wireless/intel/iwlwifi/mvm/d3.c: In function ‘iwl_mvm_wait_d3_notif’:
drivers/net/wireless/intel/iwlwifi/mvm/d3.c:2743:30: warning: array subscript ‘struct iwl_wowlan_info_notif[0]’ is partly outside array bounds of ‘unsigned char[612]’ [-Warray-bounds=]
2743 |                         notif->tid_tear_down = notif_v1->tid_tear_down;
      |
                 from drivers/net/wireless/intel/iwlwifi/mvm/d3.c:7:
In function ‘kmemdup’,
    inlined from ‘iwl_mvm_wait_d3_notif’ at drivers/net/wireless/intel/iwlwifi/mvm/d3.c:2735:12:
include/linux/fortify-string.h:765:16: note: object of size 612 allocated by ‘__real_kmemdup’
  765 |         return __real_kmemdup(p, size, gfp);
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/wireless/intel/iwlwifi/mvm/d3.c: In function ‘iwl_mvm_wait_d3_notif’:
drivers/net/wireless/intel/iwlwifi/mvm/d3.c:2744:30: warning: array subscript ‘struct iwl_wowlan_info_notif[0]’ is partly outside array bounds of ‘unsigned char[612]’ [-Warray-bounds=]
2744 |                         notif->station_id = notif_v1->station_id;
      |                              ^~
In function ‘kmemdup’,
    inlined from ‘iwl_mvm_wait_d3_notif’ at drivers/net/wireless/intel/iwlwifi/mvm/d3.c:2735:12:
include/linux/fortify-string.h:765:16: note: object of size 612 allocated by ‘__real_kmemdup’
  765 |         return __real_kmemdup(p, size, gfp);
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~

Link: https://github.com/KSPP/linux/issues/306
Fixes: 905d50ddbc83 ("wifi: iwlwifi: mvm: support wowlan info notification version 2")
Cc: [email protected]
Signed-off-by: Gustavo A. R. Silva <[email protected]>
Acked-by: Gregory Greenman <[email protected]>
Link: https://lore.kernel.org/r/ZHpGN555FwAKGduH@work
Signed-off-by: Johannes Berg <[email protected]>

wifi: mac80211: fix switch count in EMA beacons

Currently, whenever an EMA beacon is formed, due to is_template
argument being false from the caller, the switch count is always
decremented once which is wrong.

Also if switch count is equal to profile periodicity, this makes
the switch count to reach till zero which triggers a WARN_ON_ONCE.

[  261.593915] CPU: 1 PID: 800 Comm: kworker/u8:3 Not tainted 5.4.213 #0
[  261.616143] Hardware name: Qualcomm Technologies, Inc. IPQ9574
[  261.622666] Workqueue: phy0 ath12k_get_link_bss_conf [ath12k]
[  261.629771] pstate: 60400005 (nZCv daif +PAN -UAO)
[  261.635595] pc : ieee80211_next_txq+0x1ac/0x1b8 [mac80211]
[  261.640282] lr : ieee80211_beacon_update_cntdwn+0x64/0xb4 [mac80211]
[...]
[  261.729683] Call trace:
[  261.734986]  ieee80211_next_txq+0x1ac/0x1b8 [mac80211]
[  261.737156]  ieee80211_beacon_cntdwn_is_complete+0xa28/0x1194 [mac80211]
[  261.742365]  ieee80211_beacon_cntdwn_is_complete+0xef4/0x1194 [mac80211]
[  261.749224]  ieee80211_beacon_get_template_ema_list+0x38/0x5c [mac80211]
[  261.755908]  ath12k_get_link_bss_conf+0xf8/0x33b4 [ath12k]
[  261.762590]  ath12k_get_link_bss_conf+0x390/0x33b4 [ath12k]
[  261.767881]  process_one_work+0x194/0x270
[  261.773346]  worker_thread+0x200/0x314
[  261.777514]  kthread+0x140/0x150
[  261.781158]  ret_from_fork+0x10/0x18

Fix this issue by making the is_template argument as true when fetching
the EMA beacons.

Fixes: bd54f3c29077 ("wifi: mac80211: generate EMA beacons in AP mode")
Signed-off-by: Aditya Kumar Singh <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Johannes Berg <[email protected]>

wifi: mac80211: don't translate beacon/presp addrs

Don't do link address translation for beacons and probe responses,
this leads to reporting multiple scan list entries for the same AP
(one with the MLD address) which just breaks things.

We might need to extend this in the future for some other (action)
frames that aren't MLD addressed.

Fixes: 42fb9148c078 ("wifi: mac80211: do link->MLD address translation on RX")
Signed-off-by: Johannes Berg <[email protected]>
Signed-off-by: Gregory Greenman <[email protected]>
Link: https://lore.kernel.org/r/20230604120651.62adead1b43a.Ifc25eed26ebf3b269f60b1ec10060156d0e7ec0d@changeid
Signed-off-by: Johannes Berg <[email protected]>

wifi: mac80211: mlme: fix non-inheritence element

There were two bugs when creating the non-inheritence
element:
1) 'at_extension' needs to be declared outside the loop,
    otherwise the value resets every iteration and we
    can never really switch properly
2) 'added' never got set to true, so we always cut off
    the extension element again at the end of the function

This shows another issue that we might add a list but no
extension list, but we need to make the extension list a
zero-length one in that case.

Fix all these issues. While at it, add a comment explaining
the trim.

Fixes: 81151ce462e5 ("wifi: mac80211: support MLO authentication/association with one link")
Signed-off-by: Johannes Berg <[email protected]>
Signed-off-by: Gregory Greenman <[email protected]>
Link: https://lore.kernel.org/r/20230604120651.3addaa5c4782.If3a78f9305997ad7ef4ba7ffc17a8234c956f613@changeid
Signed-off-by: Johannes Berg <[email protected]>

wifi: cfg80211: reject bad AP MLD address

When trying to authenticate, if the AP MLD address isn't
a valid address, mac80211 can throw a warning. Avoid that
by rejecting such addresses.

Fixes: d648c23024bd ("wifi: nl80211: support MLO in auth/assoc")
Signed-off-by: Johannes Berg <[email protected]>
Signed-off-by: Gregory Greenman <[email protected]>
Link: https://lore.kernel.org/r/20230604120651.89188912bd1d.I8dbc6c8ee0cb766138803eec59508ef4ce477709@changeid
Signed-off-by: Johannes Berg <[email protected]>

wifi: mac80211: use correct iftype HE cap

We already check that the right iftype capa exists,
but then don't use it. Assign it to a variable so we
can actually use it, and then do that.

Fixes: bac2fd3d7534 ("mac80211: remove use of ieee80211_get_he_sta_cap()")
Signed-off-by: Johannes Berg <[email protected]>
Signed-off-by: Gregory Greenman <[email protected]>
Link: https://lore.kernel.org/r/20230604120651.0e908e5c5fdd.Iac142549a6144ac949ebd116b921a59ae5282735@changeid
Signed-off-by: Johannes Berg <[email protected]>

Bluetooth: L2CAP: Add missing checks for invalid DCID

When receiving a connect response we should make sure that the DCID is
within the valid range and that we don't already have another channel
allocated for the same DCID.
Missing checks may violate the specification (BLUETOOTH CORE SPECIFICATION
Version 5.4 | Vol 3, Part A, Page 1046).

Fixes: 40624183c202 ("Bluetooth: L2CAP: Add missing checks for invalid LE DCID")
Signed-off-by: Sungwoo Kim <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Bluetooth: ISO: use correct CIS order in Set CIG Parameters event

The order of CIS handle array in Set CIG Parameters response shall match
the order of the CIS_ID array in the command (Core v5.3 Vol 4 Part E Sec
7.8.97).  We send CIS_IDs mainly in the order of increasing CIS_ID (but
with "last" CIS first if it has fixed CIG_ID).  In handling of the
reply, we currently assume this is also the same as the order of
hci_conn in hdev->conn_hash, but that is not true.

Match the correct hci_conn to the correct handle by matching them based
on the CIG+CIS combination.  The CIG+CIS combination shall be unique for
ISO_LINK hci_conn at state >= BT_BOUND, which we maintain in
hci_le_set_cig_params.

Fixes: 26afbd826ee3 ("Bluetooth: Add initial implementation of CIS connections")
Signed-off-by: Pauli Virtanen <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Bluetooth: ISO: don't try to remove CIG if there are bound CIS left

Consider existing BOUND & CONNECT state CIS to block CIG removal.
Otherwise, under suitable timing conditions we may attempt to remove CIG
while Create CIS is pending, which fails.

Fixes: 26afbd826ee3 ("Bluetooth: Add initial implementation of CIS connections")
Signed-off-by: Pauli Virtanen <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Bluetooth: Fix l2cap_disconnect_req deadlock

L2CAP assumes that the locks conn->chan_lock and chan->lock are
acquired in the order conn->chan_lock, chan->lock to avoid
potential deadlock.
For example, l2sock_shutdown acquires these locks in the order:
  mutex_lock(&conn->chan_lock)
  l2cap_chan_lock(chan)

However, l2cap_disconnect_req acquires chan->lock in
l2cap_get_chan_by_scid first and then acquires conn->chan_lock
before calling l2cap_chan_del. This means that these locks are
acquired in unexpected order, which leads to potential deadlock:
  l2cap_chan_lock(c)
  mutex_lock(&conn->chan_lock)

This patch releases chan->lock before acquiring the conn_chan_lock
to avoid the potential deadlock.

Fixes: a2a9339e1c9d ("Bluetooth: L2CAP: Fix use-after-free in l2cap_disconnect_{req,rsp}")
Signed-off-by: Ying Hsu <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Bluetooth: hci_qca: fix debugfs registration

Since commit 3e4be65eb82c ("Bluetooth: hci_qca: Add poweroff support
during hci down for wcn3990"), the setup callback which registers the
debugfs interface can be called multiple times.

This specifically leads to the following error when powering on the
controller:

debugfs: Directory 'ibs' with parent 'hci0' already present!

Add a driver flag to avoid trying to register the debugfs interface more
than once.

Fixes: 3e4be65eb82c ("Bluetooth: hci_qca: Add poweroff support during hci down for wcn3990")
Cc: [email protected] # 4.20
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Bluetooth: fix debugfs registration

Since commit ec6cef9cd98d ("Bluetooth: Fix SMP channel registration for
unconfigured controllers") the debugfs interface for unconfigured
controllers will be created when the controller is configured.

There is however currently nothing preventing a controller from being
configured multiple time (e.g. setting the device address using btmgmt)
which results in failed attempts to register the already registered
debugfs entries:

debugfs: File 'features' in directory 'hci0' already present!
debugfs: File 'manufacturer' in directory 'hci0' already present!
debugfs: File 'hci_version' in directory 'hci0' already present!
...
debugfs: File 'quirk_simultaneous_discovery' in directory 'hci0' already present!

Add a controller flag to avoid trying to register the debugfs interface
more than once.

Fixes: ec6cef9cd98d ("Bluetooth: Fix SMP channel registration for unconfigured controllers")
Cc: [email protected] # 4.0
Signed-off-by: Johan Hovold <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Bluetooth: hci_sync: add lock to protect HCI_UNREGISTER

When the HCI_UNREGISTER flag is set, no jobs should be scheduled. Fix
potential race when HCI_UNREGISTER is set after the flag is tested in
hci_cmd_sync_queue.

Fixes: 0b94f2651f56 ("Bluetooth: hci_sync: Fix queuing commands when HCI_UNREGISTER is set")
Signed-off-by: Zhengping Jiang <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Bluetooth: Fix use-after-free in hci_remove_ltk/hci_remove_irk

Similar to commit 0f7d9b31ce7a ("netfilter: nf_tables: fix use-after-free
in nft_set_catchall_destroy()"). We can not access k after kfree_rcu()
call.

Cc: [email protected]
Signed-off-by: Min Li <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Bluetooth: ISO: Fix CIG auto-allocation to select configurable CIG

Make CIG auto-allocation to select the first CIG_ID that is still
configurable. Also use correct CIG_ID range (see Core v5.3 Vol 4 Part E
Sec 7.8.97 p.2553).

Previously, it would always select CIG_ID 0 regardless of anything,
because cis_list with data.cis == 0xff (BT_ISO_QOS_CIS_UNSET) would not
count any CIS. Since we are not adding CIS here, use find_cis instead.

Fixes: 26afbd826ee3 ("Bluetooth: Add initial implementation of CIS connections")
Signed-off-by: Pauli Virtanen <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Bluetooth: ISO: consider right CIS when removing CIG at cleanup

When looking for CIS blocking CIG removal, consider only the CIS with
the right CIG ID. Don't try to remove CIG with unset CIG ID.

Fixes: 26afbd826ee3 ("Bluetooth: Add initial implementation of CIS connections")
Signed-off-by: Pauli Virtanen <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>

bpf: netfilter: Add BPF_NETFILTER bpf_attach_type

Andrii Nakryiko writes:

And we currently don't have an attach type for NETLINK BPF link.
Thankfully it's not too late to add it. I see that link_create() in
kernel/bpf/syscall.c just bypasses attach_type check. We shouldn't
have done that. Instead we need to add BPF_NETLINK attach type to enum
bpf_attach_type. And wire all that properly throughout the kernel and
libbpf itself.

This adds BPF_NETFILTER and uses it. This breaks uabi but this
wasn't in any non-rc release yet, so it should be fine.

v2: check link_attack prog type in link_create too

Fixes: 84601d6ee68a ("bpf: add bpf_link support for BPF_NETFILTER programs")
Suggested-by: Andrii Nakryiko <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
Link: https://lore.kernel.org/bpf/CAEf4BzZ69YgrQW7DHCJUT_X+GqMq_ZQQPBwopaJJVGFD5=d5Vg@mail.gmail.com/
Link: https://lore.kernel.org/bpf/[email protected]

HID: hidpp: terminate retry loop on success

It seems we forgot the normal case to terminate the retry loop,
making us asking 3 times each command, which is probably a little bit
too much.

And remove the ugly "goto exit" that can be replaced by a simpler "break"

Fixes: 586e8fede795 ("HID: logitech-hidpp: Retry commands when device is busy")
Suggested-by: Mark Lord <[email protected]>
Tested-by: Mark Lord <[email protected]>
Signed-off-by: Benjamin Tissoires <[email protected]>
Signed-off-by: Jiri Kosina <[email protected]>

Merge tag 'asym-keys-fix-for-linus-v6.4-rc5' of https://github.com/robertosassu/linux

Pull asymmetric keys fix from Roberto Sassu:
"Here is a small fix to make an unconditional copy of the buffer passed
  to crypto operations, to take into account the case of the stack not
  in the linear mapping area.

  It has been tested and verified to fix the bug"

Acked-by: Herbert Xu <[email protected]>
Acked-by: David Howells <[email protected]>
* tag 'asym-keys-fix-for-linus-v6.4-rc5' of https://github.com/robertosassu/linux:
  KEYS: asymmetric: Copy sig and digest in public_key_verify_signature()

Merge branch 'mptcp-addr-adv-fixes'

Mat Martineau says:

====================
mptcp: Fixes for address advertisement

Patches 1 and 2 allow address advertisements to be removed without
affecting current connected subflows, and updates associated self tests.

Patches 3 and 4 correctly track (and allow removal of) addresses that
were implicitly announced as part of subflow creation. Also updates
associated self tests.

Patch 5 makes subflow and address announcement counters work consistently
between the userspace and in-kernel path managers.
====================

Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mptcp: update userspace pm infos

Increase pm subflows counter on both server side and client side when
userspace pm creates a new subflow, and decrease the counter when it
closes a subflow.

Increase add_addr_signaled counter in mptcp_nl_cmd_announce() when the
address is announced by userspace PM.

This modification is similar to how the in-kernel PM is updating the
counter: when additional subflows are created/removed.

Fixes: 9ab4807c84a4 ("mptcp: netlink: Add MPTCP_PM_CMD_ANNOUNCE")
Fixes: 702c2f646d42 ("mptcp: netlink: allow userspace-driven subflow establishment")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/329
Cc: [email protected]
Reviewed-by: Matthieu Baerts <[email protected]>
Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

selftests: mptcp: update userspace pm subflow tests

To align with what is done by the in-kernel PM, update userspace pm
subflow selftests, by sending the a remove_addrs command together
before the remove_subflows command. This will get a RM_ADDR in
chk_rm_nr().

Fixes: d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE")
Fixes: 5e986ec46874 ("selftests: mptcp: userspace pm subflow tests")
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/379
Cc: [email protected]
Reviewed-by: Matthieu Baerts <[email protected]>
Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mptcp: add address into userspace pm list

Add the address into userspace_pm_local_addr_list when the subflow is
created. Make sure it can be found in mptcp_nl_cmd_remove(). And delete
it in the new helper mptcp_userspace_pm_delete_local_addr().

By doing this, the "REMOVE" command also works with subflows that have
been created via the "SUB_CREATE" command instead of restricting to
the addresses that have been announced via the "ANNOUNCE" command.

Fixes: d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE")
Link: https://github.com/multipath-tcp/mptcp_net-next/issues/379
Cc: [email protected]
Reviewed-by: Matthieu Baerts <[email protected]>
Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

selftests: mptcp: update userspace pm addr tests

This patch is linked to the previous commit ("mptcp: only send RM_ADDR in
nl_cmd_remove").

To align with what is done by the in-kernel PM, update userspace pm addr
selftests, by sending a remove_subflows command together after the
remove_addrs command.

Fixes: d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE")
Fixes: 97040cf9806e ("selftests: mptcp: userspace pm address tests")
Cc: [email protected]
Reviewed-by: Matthieu Baerts <[email protected]>
Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

mptcp: only send RM_ADDR in nl_cmd_remove

The specifications from [1] about the "REMOVE" command say:

Announce that an address has been lost to the peer

It was then only supposed to send a RM_ADDR and not trying to delete
associated subflows.

A new helper mptcp_pm_remove_addrs() is then introduced to do just
that, compared to mptcp_pm_remove_addrs_and_subflows() also removing
subflows.

To delete a subflow, the userspace daemon can use the "SUB_DESTROY"
command, see mptcp_nl_cmd_sf_destroy().

Fixes: d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE")
Link: https://github.com/multipath-tcp/mptcp/blob/mptcp_v0.96/include/uapi/linux/mptcp.h
Cc: [email protected]
Reviewed-by: Matthieu Baerts <[email protected]>
Signed-off-by: Geliang Tang <[email protected]>
Signed-off-by: Mat Martineau <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: stmmac: dwmac-qcom-ethqos: fix a regression on EMAC < 3

We must not assign plat_dat->dwmac4_addrs unconditionally as for
structures which don't set them, this will result in the core driver
using zeroes everywhere and breaking the driver for older HW. On EMAC < 2
the address should remain NULL.

Fixes: b68376191c69 ("net: stmmac: dwmac-qcom-ethqos: Add EMAC3 support")
Signed-off-by: Bartosz Golaszewski <[email protected]>
Reviewed-by: Andrew Halaney <[email protected]>
Reviewed-by: Siddharth Vadapalli <[email protected]>
Reviewed-by: Vinod Koul <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'linux-can-fixes-for-6.4-20230605' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================

this is a pull request of 3 patches for net/master.

All 3 patches target the j1939 stack.

The 1st patch is by Oleksij Rempel and fixes the error queue handling
for (E)TP sessions that run into timeouts.

The last 2 patches are by Fedor Pchelkin and fix a potential
use-after-free in j1939_netdev_start() if j1939_can_rx_register()
fails.
====================

Signed-off-by: David S. Miller <[email protected]>

net/sched: fq_pie: ensure reasonable TCA_FQ_PIE_QUANTUM values

We got multiple syzbot reports, all duplicates of the following [1]

syzbot managed to install fq_pie with a zero TCA_FQ_PIE_QUANTUM,
thus triggering infinite loops.

Use limits similar to sch_fq, with commits
3725a269815b ("pkt_sched: fq: avoid hang when quantum 0") and
d9e15a273306 ("pkt_sched: fq: do not accept silly TCA_FQ_QUANTUM")

[1]
watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:0]
Modules linked in:
irq event stamp: 172817
hardirqs last enabled at (172816): [<ffff80001242fde4>] __el1_irq arch/arm64/kernel/entry-common.c:476 [inline]
hardirqs last enabled at (172816): [<ffff80001242fde4>] el1_interrupt+0x58/0x68 arch/arm64/kernel/entry-common.c:486
hardirqs last disabled at (172817): [<ffff80001242fdb0>] __el1_irq arch/arm64/kernel/entry-common.c:468 [inline]
hardirqs last disabled at (172817): [<ffff80001242fdb0>] el1_interrupt+0x24/0x68 arch/arm64/kernel/entry-common.c:486
softirqs last enabled at (167634): [<ffff800008020c1c>] softirq_handle_end kernel/softirq.c:414 [inline]
softirqs last enabled at (167634): [<ffff800008020c1c>] __do_softirq+0xac0/0xd54 kernel/softirq.c:600
softirqs last disabled at (167701): [<ffff80000802a660>] ____do_softirq+0x14/0x20 arch/arm64/kernel/irq.c:80
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.4.0-rc3-syzkaller-geb0f1697d729 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/28/2023
pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : fq_pie_qdisc_dequeue+0x10c/0x8ac net/sched/sch_fq_pie.c:246
lr : fq_pie_qdisc_dequeue+0xe4/0x8ac net/sched/sch_fq_pie.c:240
sp : ffff800008007210
x29: ffff800008007280 x28: ffff0000c86f7890 x27: ffff0000cb20c2e8
x26: ffff0000cb20c2f0 x25: dfff800000000000 x24: ffff0000cb20c2e0
x23: ffff0000c86f7880 x22: 0000000000000040 x21: 1fffe000190def10
x20: ffff0000cb20c2e0 x19: ffff0000cb20c2e0 x18: ffff800008006e60
x17: 0000000000000000 x16: ffff80000850af6c x15: 0000000000000302
x14: 0000000000000100 x13: 0000000000000000 x12: 0000000000000001
x11: 0000000000000302 x10: 0000000000000100 x9 : 0000000000000000
x8 : 0000000000000000 x7 : ffff80000841c468 x6 : 0000000000000000
x5 : 0000000000000001 x4 : 0000000000000001 x3 : 0000000000000000
x2 : ffff0000cb20c2e0 x1 : ffff0000cb20c2e0 x0 : 0000000000000001
Call trace:
fq_pie_qdisc_dequeue+0x10c/0x8ac net/sched/sch_fq_pie.c:246
dequeue_skb net/sched/sch_generic.c:292 [inline]
qdisc_restart net/sched/sch_generic.c:397 [inline]
__qdisc_run+0x1fc/0x231c net/sched/sch_generic.c:415
__dev_xmit_skb net/core/dev.c:3868 [inline]
__dev_queue_xmit+0xc80/0x3318 net/core/dev.c:4210
dev_queue_xmit include/linux/netdevice.h:3085 [inline]
neigh_connected_output+0x2f8/0x38c net/core/neighbour.c:1581
neigh_output include/net/neighbour.h:544 [inline]
ip6_finish_output2+0xd60/0x1a1c net/ipv6/ip6_output.c:134
__ip6_finish_output net/ipv6/ip6_output.c:195 [inline]
ip6_finish_output+0x538/0x8c8 net/ipv6/ip6_output.c:206
NF_HOOK_COND include/linux/netfilter.h:292 [inline]
ip6_output+0x270/0x594 net/ipv6/ip6_output.c:227
dst_output include/net/dst.h:458 [inline]
NF_HOOK include/linux/netfilter.h:303 [inline]
ndisc_send_skb+0xc30/0x1790 net/ipv6/ndisc.c:508
ndisc_send_rs+0x47c/0x5d4 net/ipv6/ndisc.c:718
addrconf_rs_timer+0x300/0x58c net/ipv6/addrconf.c:3936
call_timer_fn+0x19c/0x8cc kernel/time/timer.c:1700
expire_timers kernel/time/timer.c:1751 [inline]
__run_timers+0x55c/0x734 kernel/time/timer.c:2022
run_timer_softirq+0x7c/0x114 kernel/time/timer.c:2035
__do_softirq+0x2d0/0xd54 kernel/softirq.c:571
____do_softirq+0x14/0x20 arch/arm64/kernel/irq.c:80
call_on_irq_stack+0x24/0x4c arch/arm64/kernel/entry.S:882
do_softirq_own_stack+0x20/0x2c arch/arm64/kernel/irq.c:85
invoke_softirq kernel/softirq.c:452 [inline]
__irq_exit_rcu+0x28c/0x534 kernel/softirq.c:650
irq_exit_rcu+0x14/0x84 kernel/softirq.c:662
__el1_irq arch/arm64/kernel/entry-common.c:472 [inline]
el1_interrupt+0x38/0x68 arch/arm64/kernel/entry-common.c:486
el1h_64_irq_handler+0x18/0x24 arch/arm64/kernel/entry-common.c:491
el1h_64_irq+0x64/0x68 arch/arm64/kernel/entry.S:587
__daif_local_irq_enable arch/arm64/include/asm/irqflags.h:33 [inline]
arch_local_irq_enable+0x8/0xc arch/arm64/include/asm/irqflags.h:55
cpuidle_idle_call kernel/sched/idle.c:170 [inline]
do_idle+0x1f0/0x4e8 kernel/sched/idle.c:282
cpu_startup_entry+0x24/0x28 kernel/sched/idle.c:379
rest_init+0x2dc/0x2f4 init/main.c:735
start_kernel+0x0/0x55c init/main.c:834
start_kernel+0x3f0/0x55c init/main.c:1088
__primary_switched+0xb8/0xc0 arch/arm64/kernel/head.S:523

Fixes: ec97ecf1ebe4 ("net: sched: add Flow Queue PIE packet scheduler")
Reported-by: syzbot <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge patch series "can: j1939: avoid possible use-after-free when j1939_can_rx_register fails"

Fedor Pchelkin <[email protected]> says:

The patch series fixes a possible racy use-after-free scenario
described in 2/2: if j1939_can_rx_register() fails then the concurrent
thread may have already read the invalid priv structure.

The 1/2 makes j1939_netdev_lock a mutex so that access to
j1939_can_rx_register() can be serialized without changing GFP_KERNEL
to GFP_ATOMIC inside can_rx_register(). This seems to be safe.

Note that the patch series has been tested only via Syzkaller and not
with a real device.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Marc Kleine-Budde <[email protected]>

can: j1939: avoid possible use-after-free when j1939_can_rx_register fails

Syzkaller reports the following failure:

BUG: KASAN: use-after-free in kref_put include/linux/kref.h:64 [inline]
BUG: KASAN: use-after-free in j1939_priv_put+0x25/0xa0 net/can/j1939/main.c:172
Write of size 4 at addr ffff888141c15058 by task swapper/3/0

CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.10.144-syzkaller #0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Call Trace:
<IRQ>
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x107/0x167 lib/dump_stack.c:118
print_address_description.constprop.0+0x1c/0x220 mm/kasan/report.c:385
__kasan_report mm/kasan/report.c:545 [inline]
kasan_report.cold+0x1f/0x37 mm/kasan/report.c:562
check_memory_region_inline mm/kasan/generic.c:186 [inline]
check_memory_region+0x145/0x190 mm/kasan/generic.c:192
instrument_atomic_read_write include/linux/instrumented.h:101 [inline]
atomic_fetch_sub_release include/asm-generic/atomic-instrumented.h:220 [inline]
__refcount_sub_and_test include/linux/refcount.h:272 [inline]
__refcount_dec_and_test include/linux/refcount.h:315 [inline]
refcount_dec_and_test include/linux/refcount.h:333 [inline]
kref_put include/linux/kref.h:64 [inline]
j1939_priv_put+0x25/0xa0 net/can/j1939/main.c:172
j1939_sk_sock_destruct+0x44/0x90 net/can/j1939/socket.c:374
__sk_destruct+0x4e/0x820 net/core/sock.c:1784
rcu_do_batch kernel/rcu/tree.c:2485 [inline]
rcu_core+0xb35/0x1a30 kernel/rcu/tree.c:2726
__do_softirq+0x289/0x9a3 kernel/softirq.c:298
asm_call_irq_on_stack+0x12/0x20
</IRQ>
__run_on_irqstack arch/x86/include/asm/irq_stack.h:26 [inline]
run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:77 [inline]
do_softirq_own_stack+0xaa/0xe0 arch/x86/kernel/irq_64.c:77
invoke_softirq kernel/softirq.c:393 [inline]
__irq_exit_rcu kernel/softirq.c:423 [inline]
irq_exit_rcu+0x136/0x200 kernel/softirq.c:435
sysvec_apic_timer_interrupt+0x4d/0x100 arch/x86/kernel/apic/apic.c:1095
asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:635

Allocated by task 1141:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
kasan_set_track mm/kasan/common.c:56 [inline]
__kasan_kmalloc.constprop.0+0xc9/0xd0 mm/kasan/common.c:461
kmalloc include/linux/slab.h:552 [inline]
kzalloc include/linux/slab.h:664 [inline]
j1939_priv_create net/can/j1939/main.c:131 [inline]
j1939_netdev_start+0x111/0x860 net/can/j1939/main.c:268
j1939_sk_bind+0x8ea/0xd30 net/can/j1939/socket.c:485
__sys_bind+0x1f2/0x260 net/socket.c:1645
__do_sys_bind net/socket.c:1656 [inline]
__se_sys_bind net/socket.c:1654 [inline]
__x64_sys_bind+0x6f/0xb0 net/socket.c:1654
do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x61/0xc6

Freed by task 1141:
kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
kasan_set_track+0x1c/0x30 mm/kasan/common.c:56
kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355
__kasan_slab_free+0x112/0x170 mm/kasan/common.c:422
slab_free_hook mm/slub.c:1542 [inline]
slab_free_freelist_hook+0xad/0x190 mm/slub.c:1576
slab_free mm/slub.c:3149 [inline]
kfree+0xd9/0x3b0 mm/slub.c:4125
j1939_netdev_start+0x5ee/0x860 net/can/j1939/main.c:300
j1939_sk_bind+0x8ea/0xd30 net/can/j1939/socket.c:485
__sys_bind+0x1f2/0x260 net/socket.c:1645
__do_sys_bind net/socket.c:1656 [inline]
__se_sys_bind net/socket.c:1654 [inline]
__x64_sys_bind+0x6f/0xb0 net/socket.c:1654
do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x61/0xc6

It can be caused by this scenario:

CPU0 CPU1
j1939_sk_bind(socket0, ndev0, ...)
  j1939_netdev_start()
j1939_sk_bind(socket1, ndev0, ...)
                                          j1939_netdev_start()
  mutex_lock(&j1939_netdev_lock)
  j1939_priv_set(ndev0, priv)
  mutex_unlock(&j1939_netdev_lock)
  if (priv_new)
    kref_get(&priv_new->rx_kref)
    return priv_new;
  /* inside j1939_sk_bind() */
  jsk->priv = priv
  j1939_can_rx_register(priv) // fails
  j1939_priv_set(ndev, NULL)
  kfree(priv)
j1939_sk_sock_destruct()
j1939_priv_put() // <- uaf

To avoid this, call j1939_can_rx_register() under j1939_netdev_lock so
that a concurrent thread cannot process j1939_priv before
j1939_can_rx_register() returns.

Found by Linux Verification Center (linuxtesting.org) with Syzkaller.

Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Signed-off-by: Fedor Pchelkin <[email protected]>
Tested-by: Oleksij Rempel <[email protected]>
Acked-by: Oleksij Rempel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Cc: [email protected]
Signed-off-by: Marc Kleine-Budde <[email protected]>

can: j1939: change j1939_netdev_lock type to mutex

It turns out access to j1939_can_rx_register() needs to be serialized,
otherwise j1939_priv can be corrupted when parallel threads call
j1939_netdev_start() and j1939_can_rx_register() fails. This issue is
thoroughly covered in other commit which serializes access to
j1939_can_rx_register().

Change j1939_netdev_lock type to mutex so that we do not need to remove
GFP_KERNEL from can_rx_register().

j1939_netdev_lock seems to be used in normal contexts where mutex usage
is not prohibited.

Found by Linux Verification Center (linuxtesting.org) with Syzkaller.

Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Suggested-by: Alexey Khoroshilov <[email protected]>
Signed-off-by: Fedor Pchelkin <[email protected]>
Tested-by: Oleksij Rempel <[email protected]>
Acked-by: Oleksij Rempel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Cc: [email protected]
Signed-off-by: Marc Kleine-Budde <[email protected]>

can: j1939: j1939_sk_send_loop_abort(): improved error queue handling in J1939 Socket

This patch addresses an issue within the j1939_sk_send_loop_abort()
function in the j1939/socket.c file, specifically in the context of
Transport Protocol (TP) sessions.

Without this patch, when a TP session is initiated and a Clear To Send
(CTS) frame is received from the remote side requesting one data packet,
the kernel dispatches the first Data Transport (DT) frame and then waits
for the next CTS. If the remote side doesn't respond with another CTS,
the kernel aborts due to a timeout. This leads to the user-space
receiving an EPOLLERR on the socket, and the socket becomes active.

However, when trying to read the error queue from the socket with
sock.recvmsg(, , socket.MSG_ERRQUEUE), it returns -EAGAIN,
given that the socket is non-blocking. This situation results in an
infinite loop: the user-space repeatedly calls epoll(), epoll() returns
the socket file descriptor with EPOLLERR, but the socket then blocks on
the recv() of ERRQUEUE.

This patch introduces an additional check for the J1939_SOCK_ERRQUEUE
flag within the j1939_sk_send_loop_abort() function. If the flag is set,
it indicates that the application has subscribed to receive error queue
messages. In such cases, the kernel can communicate the current transfer
state via the error queue. This allows for the function to return early,
preventing the unnecessary setting of the socket into an error state,
and breaking the infinite loop. It is crucial to note that a socket
error is only needed if the application isn't using the error queue, as,
without it, the application wouldn't be aware of transfer issues.

Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Reported-by: David Jander <[email protected]>
Tested-by: David Jander <[email protected]>
Signed-off-by: Oleksij Rempel <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Cc: [email protected]
Signed-off-by: Marc Kleine-Budde <[email protected]>

xfs: collect errors from inodegc for unlinked inode recovery

Unlinked list recovery requires errors removing the inode the from
the unlinked list get fed back to the main recovery loop. Now that
we offload the unlinking to the inodegc work, we don't get errors
being fed back when we trip over a corruption that prevents the
inode from being removed from the unlinked list.

This means we never clear the corrupt unlinked list bucket,
resulting in runtime operations eventually tripping over it and
shutting down.

Fix this by collecting inodegc worker errors and feed them
back to the flush caller. This is largely best effort - the only
context that really cares is log recovery, and it only flushes a
single inode at a time so we don't need complex synchronised
handling. Essentially the inodegc workers will capture the first
error that occurs and the next flush will gather them and clear
them. The flush itself will only report the first gathered error.

In the cases where callers can return errors, propagate the
collected inodegc flush error up the error handling chain.

In the case of inode unlinked list recovery, there are several
superfluous calls to flush queued unlinked inodes -
xlog_recover_iunlink_bucket() guarantees that it has flushed the
inodegc and collected errors before it returns. Hence nothing in the
calling path needs to run a flush, even when an error is returned.

Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>

xfs: validate block number being freed before adding to xefi

Bad things happen in defered extent freeing operations if it is
passed a bad block number in the xefi. This can come from a bogus
agno/agbno pair from deferred agfl freeing, or just a bad fsbno
being passed to __xfs_free_extent_later(). Either way, it's very
difficult to diagnose where a null perag oops in EFI creation
is coming from when the operation that queued the xefi has already
been completed and there's no longer any trace of it around....

Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>

xfs: validity check agbnos on the AGFL

If the agfl or the indexing in the AGF has been corrupted, getting a
block form the AGFL could return an invalid block number. If this
happens, bad things happen. Check the agbno we pull off the AGFL
and return -EFSCORRUPTED if we find somethign bad.

Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>

xfs: fix agf/agfl verification on v4 filesystems

When a v4 filesystem has fl_last - fl_first != fl_count, we do not
not detect the corruption and allow the AGF to be used as it if was
fully valid. On V5 filesystems, we reset the AGFL to empty in these
cases and avoid the corruption at a small cost of leaked blocks.

If we don't catch the corruption on V4 filesystems, bad things
happen later when an allocation attempts to trim the free list
and either double-frees stale entries in the AGFl or tries to free
NULLAGBNO entries.

Either way, this is bad. Prevent this from happening by using the
AGFL_NEED_RESET logic for v4 filesysetms, too.

Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>

xfs: fix double xfs_perag_rele() in xfs_filestream_pick_ag()

xfs_bmap_longest_free_extent() can return an error when accessing
the AGF fails. In this case, the behaviour of
xfs_filestream_pick_ag() is conditional on the error. We may
continue the loop, or break out of it. The error handling after the
loop cleans up the perag reference held when the break occurs. If we
continue, the next loop iteration handles cleaning up the perag
reference.

EIther way, we don't need to release the active perag reference when
xfs_bmap_longest_free_extent() fails. Doing so means we do a double
decrement on the active reference count, and this causes tha active
reference count to fall to zero. At this point, new active
references will fail.

This leads to unmount hanging because it tries to grab active
references to that perag, only for it to fail. This happens inside a
loop that retries until a inode tree radix tree tag is cleared,
which cannot happen because we can't get an active reference to the
perag.

The unmount livelocks in this path:

  xfs_reclaim_inodes+0x80/0xc0
  xfs_unmount_flush_inodes+0x5b/0x70
  xfs_unmountfs+0x5b/0x1a0
  xfs_fs_put_super+0x49/0x110
  generic_shutdown_super+0x7c/0x1a0
  kill_block_super+0x27/0x50
  deactivate_locked_super+0x30/0x90
  deactivate_super+0x3c/0x50
  cleanup_mnt+0xc2/0x160
  __cleanup_mnt+0x12/0x20
  task_work_run+0x5e/0xa0
  exit_to_user_mode_prepare+0x1bc/0x1c0
  syscall_exit_to_user_mode+0x16/0x40
  do_syscall_64+0x40/0x80
  entry_SYSCALL_64_after_hwframe+0x63/0xcd

Reported-by: Pengfei Xu <[email protected]>
Fixes: eb70aa2d8ed9 ("xfs: use for_each_perag_wrap in xfs_filestream_pick_ag")
Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>

xfs: fix broken logic when detecting mergeable bmap records

Commit 6bc6c99a944c was a well-intentioned effort to initiate
consolidation of adjacent bmbt mapping records by setting the PREEN
flag.  Consolidation can only happen if the length of the combined
record doesn't overflow the 21-bit blockcount field of the bmbt
recordset.  Unfortunately, the length test is inverted, leading to it
triggering on data forks like these:

EXT: FILE-OFFSET           BLOCK-RANGE        AG AG-OFFSET               TOTAL
   0: [0..16777207]:        76110848..92888055  0 (76110848..92888055) 16777208
   1: [16777208..20639743]: 92888056..96750591  0 (92888056..96750591)  3862536

Note that record 0 has a length of 16777208 512b blocks.  This
corresponds to 2097151 4k fsblocks, which is the maximum.  Hence the two
records cannot be merged.

However, the logic is still wrong even if we change the in-loop
comparison, because the scope of our examination isn't broad enough
inside the loop to detect mappings like this:

   0: [0..9]:               76110838..76110847  0 (76110838..76110847)       10
   1: [10..16777217]:       76110848..92888055  0 (76110848..92888055) 16777208
   2: [16777218..20639753]: 92888056..96750591  0 (92888056..96750591)  3862536

These three records could be merged into two, but one cannot determine
this purely from looking at records 0-1 or 1-2 in isolation.

Hoist the mergability detection outside the loop, and base its decision
making on whether or not a merged mapping could be expressed in fewer
bmbt records.  While we're at it, fix the incorrect return type of the
iter function.

Fixes: 336642f79283 ("xfs: alert the user about data/attr fork mappings that could be merged")
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>

xfs: Fix undefined behavior of shift into sign bit

With gcc-5:

    In file included from ./include/trace/define_trace.h:102:0,
     from ./fs/xfs/scrub/trace.h:988,
     from fs/xfs/scrub/trace.c:40:
    ./fs/xfs/./scrub/trace.h: In function ‘trace_raw_output_xchk_fsgate_class’:
    ./fs/xfs/scrub/scrub.h:111:28: error: initializer element is not constant
     #define XREP_ALREADY_FIXED (1 << 31) /* checking our repair work */
^

Shifting the (signed) value 1 into the sign bit is undefined behavior.

Fix this for all definitions in the file by shifting "1U" instead of
"1".

This was exposed by the first user added in commit 466c525d6d35e691
("xfs: minimize overhead of drain wakeups by using jump labels").

Fixes: 160b5a784525e8a4 ("xfs: hoist the already_fixed variable to the scrub context")
Signed-off-by: Geert Uytterhoeven <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>

xfs: fix AGF vs inode cluster buffer deadlock

Lock order in XFS is AGI -> AGF, hence for operations involving
inode unlinked list operations we always lock the AGI first. Inode
unlinked list operations operate on the inode cluster buffer,
so the lock order there is AGI -> inode cluster buffer.

For O_TMPFILE operations, this now means the lock order set down in
xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
unlinked ops are done before the directory modifications that may
allocate space and lock the AGF.

Unfortunately, we also now lock the inode cluster buffer when
logging an inode so that we can attach the inode to the cluster
buffer and pin it in memory. This creates a lock order of AGF ->
inode cluster buffer in directory operations as we have to log the
inode after we've allocated new space for it.

This creates a lock inversion between the AGF and the inode cluster
buffer. Because the inode cluster buffer is shared across multiple
inodes, the inversion is not specific to individual inodes but can
occur when inodes in the same cluster buffer are accessed in
different orders.

To fix this we need move all the inode log item cluster buffer
interactions to the end of the current transaction. Unfortunately,
xfs_trans_log_inode() calls are littered throughout the transactions
with no thought to ordering against other items or locking. This
makes it difficult to do anything that involves changing the call
sites of xfs_trans_log_inode() to change locking orders.

However, we do now have a mechanism that allows is to postpone dirty
item processing to just before we commit the transaction: the
->iop_precommit method. This will be called after all the
modifications are done and high level objects like AGI and AGF
buffers have been locked and modified, thereby providing a mechanism
that guarantees we don't lock the inode cluster buffer before those
high level objects are locked.

This change is largely moving the guts of xfs_trans_log_inode() to
xfs_inode_item_precommit() and providing an extra flag context in
the inode log item to track the dirty state of the inode in the
current transaction. This also means we do a lot less repeated work
in xfs_trans_log_inode() by only doing it once per transaction when
all the work is done.

Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item")
Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>

xfs: defered work could create precommits

To fix a AGI-AGF-inode cluster buffer deadlock, we need to move
inode cluster buffer operations to the ->iop_precommit() method.
However, this means that deferred operations can require precommits
to be run on the final transaction that the deferred ops pass back
to xfs_trans_commit() context. This will be exposed by attribute
handling, in that the last changes to the inode in the attr set
state machine "disappear" because the precommit operation is not run.

Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>

xfs: restore allocation trylock iteration

It was accidentally dropped when refactoring the allocation code,
resulting in the AG iteration always doing blocking AG iteration.
This results in a small performance regression for a specific fsmark
test that runs more user data writer threads than there are AGs.

Reported-by: kernel test robot <[email protected]>
Fixes: 2edf06a50f5b ("xfs: factor xfs_alloc_vextent_this_ag() for _iterate_ags()")
Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>

xfs: buffer pins need to hold a buffer reference

When a buffer is unpinned by xfs_buf_item_unpin(), we need to access
the buffer after we've dropped the buffer log item reference count.
This opens a window where we can have two racing unpins for the
buffer item (e.g. shutdown checkpoint context callback processing
racing with journal IO iclog completion processing) and both attempt
to access the buffer after dropping the BLI reference count. If we
are unlucky, the "BLI freed" context wins the race and frees the
buffer before the "BLI still active" case checks the buffer pin
count.

This results in a use after free that can only be triggered
in active filesystem shutdown situations.

To fix this, we need to ensure that buffer existence extends beyond
the BLI reference count checks and until the unpin processing is
complete. This implies that a buffer pin operation must also take a
buffer reference to ensure that the buffer cannot be freed until the
buffer unpin processing is complete.

Reported-by: yangerkun <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>

Linux 6.4-rc5

Merge tag 'irq_urgent_for_v6.4_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fix from Borislav Petkov:

- Fix open firmware quirks validation so that they don't get applied
wrongly

* tag 'irq_urgent_for_v6.4_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic: Correctly validate OF quirk descriptors

net: sched: wrap tc_skip_wrapper with CONFIG_RETPOLINE

This patch fixes the following sparse warning:

net/sched/sch_api.c:2305:1: sparse: warning: symbol 'tc_skip_wrapper' was not declared. Should it be static?

No functional change intended.

Signed-off-by: Min-Hua Chen <[email protected]>
Acked-by: Pedro Tammela <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge branch 'enetc-fixes'

Wei Fang says:

====================
net: enetc: correct the statistics of rx bytes

The purpose of this patch set is to fix the issue of rx bytes
statistics. The first patch corrects the rx bytes statistics
of normal kernel protocol stack path, and the second patch is
used to correct the rx bytes statistics of XDP.
====================

Signed-off-by: David S. Miller <[email protected]>

net: enetc: correct rx_bytes statistics of XDP

The rx_bytes statistics of XDP are always zero, because rx_byte_cnt
is not updated after it is initialized to 0. So fix it.

Fixes: d1b15102dd16 ("net: enetc: add support for XDP_DROP and XDP_PASS")
Signed-off-by: Wei Fang <[email protected]>
Reviewed-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

net: enetc: correct the statistics of rx bytes

The rx_bytes of struct net_device_stats should count the length of
ethernet frames excluding the FCS. However, there are two problems
with the rx_bytes statistics of the current enetc driver. one is
that the length of VLAN header is not counted if the VLAN extraction
feature is enabled. The other is that the length of L2 header is not
counted, because eth_type_trans() is invoked before updating rx_bytes
which will subtract the length of L2 header from skb->len.
BTW, the rx_bytes statistics of XDP path also have similar problem,
I will fix it in another patch.

Fixes: a800abd3ecb9 ("net: enetc: move skb creation into enetc_build_skb")
Signed-off-by: Wei Fang <[email protected]>
Reviewed-by: Vladimir Oltean <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'media/v6.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Pull media fixes from Mauro Carvalho Chehab:
"Some driver fixes:
   - a regression fix for the verisilicon driver
   - uvcvideo: don't expose unsupported video formats to userspace
   - camss-video: don't zero subdev format after init
   - mediatek: some fixes for 4K decoder formats
   - fix a Sphinx build warning (missing doc for client_caps)
   - some fixes for imx and atomisp staging drivers

  And two CEC core fixes:
   - don't set last_initiator if TX in progress
   - disable adapter in cec_devnode_unregister"

* tag 'media/v6.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  media: uvcvideo: Don't expose unsupported formats to userspace
  media: v4l2-subdev: Fix missing kerneldoc for client_caps
  media: staging: media: imx: initialize hs_settle to avoid warning
  media: v4l2-mc: Drop subdev check in v4l2_create_fwnode_links_to_pad()
  media: staging: media: atomisp: init high & low vars
  media: cec: core: don't set last_initiator if tx in progress
  media: cec: core: disable adapter in cec_devnode_unregister
  media: mediatek: vcodec: Only apply 4K frame sizes on decoder formats
  media: camss: camss-video: Don't zero subdev format again after initialization
  media: verisilicon: Additional fix for the crash when opening the driver

Merge tag 'char-misc-6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver fixes from Greg KH:
"Here are a bunch of tiny char/misc/other driver fixes for 6.4-rc5 that
  resolve a number of reported issues. Included in here are:

   - iio driver fixes

   - fpga driver fixes

   - test_firmware bugfixes

   - fastrpc driver tiny bugfixes

   - MAINTAINERS file updates for some subsystems

  All of these have been in linux-next this past week with no reported
  issues"

* tag 'char-misc-6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (34 commits)
  test_firmware: fix the memory leak of the allocated firmware buffer
  test_firmware: fix a memory leak with reqs buffer
  test_firmware: prevent race conditions by a correct implementation of locking
  firmware_loader: Fix a NULL vs IS_ERR() check
  MAINTAINERS: Vaibhav Gupta is the new ipack maintainer
  dt-bindings: fpga: replace Ivan Bornyakov maintainership
  MAINTAINERS: update Microchip MPF FPGA reviewers
  misc: fastrpc: reject new invocations during device removal
  misc: fastrpc: return -EPIPE to invocations on device removal
  misc: fastrpc: Reassign memory ownership only for remote heap
  misc: fastrpc: Pass proper scm arguments for secure map request
  iio: imu: inv_icm42600: fix timestamp reset
  iio: adc: ad_sigma_delta: Fix IRQ issue by setting IRQ_DISABLE_UNLAZY flag
  dt-bindings: iio: adc: renesas,rcar-gyroadc: Fix adi,ad7476 compatible value
  iio: dac: mcp4725: Fix i2c_master_send() return value handling
  iio: accel: kx022a fix irq getting
  iio: bu27034: Ensure reset is written
  iio: dac: build ad5758 driver when AD5758 is selected
  iio: addac: ad74413: fix resistance input processing
  iio: light: vcnl4035: fixed chip ID check
  ...

Merge tag 'driver-core-6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
"Here are two small driver core cacheinfo fixes for 6.4-rc5 that
  resolve a number of reported issues with that file. These changes have
  been in linux-next this past week with no reported problems"

* tag 'driver-core-6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  drivers: base: cacheinfo: Update cpu_map_populated during CPU Hotplug
  drivers: base: cacheinfo: Fix shared_cpu_map changes in event of CPU hotplug

Merge tag 'tty-6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial driver fixes from Greg KH:
"Here are some small tty/serial driver fixes for 6.4-rc5 that have all
  been in linux-next this past week with no reported problems. Included
  in here are:

   - 8250_tegra driver bugfix

   - fsl uart driver bugfixes

   - Kconfig fix for dependancy issue

   - dt-bindings fix for the 8250_omap driver"

* tag 'tty-6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  dt-bindings: serial: 8250_omap: add rs485-rts-active-high
  serial: cpm_uart: Fix a COMPILE_TEST dependency
  soc: fsl: cpm1: Fix TSA and QMC dependencies in case of COMPILE_TEST
  tty: serial: fsl_lpuart: use UARTCTRL_TXINV to send break instead of UARTCTRL_SBK
  serial: 8250_tegra: Fix an error handling path in tegra_uart_probe()

Merge tag 'usb-6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
"Here are some USB driver and core fixes for 6.4-rc5. Most of these are
  tiny driver fixes, including:

   - udc driver bugfix

   - f_fs gadget driver bugfix

   - cdns3 driver bugfix

   - typec bugfixes

  But the "big" thing in here is a fix yet-again for how the USB buffers
  are handled from userspace when dealing with DMA issues. The changes
  were discussed a lot, and tested a lot, on the list, and acked by the
  relevant mm maintainers and have been in linux-next all this past week
  with no reported problems"

* tag 'usb-6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  usb: typec: tps6598x: Fix broken polling mode after system suspend/resume
  mm: page_table_check: Ensure user pages are not slab pages
  mm: page_table_check: Make it dependent on EXCLUSIVE_SYSTEM_RAM
  usb: usbfs: Use consistent mmap functions
  usb: usbfs: Enforce page requirements for mmap
  dt-bindings: usb: snps,dwc3: Fix "snps,hsphy_interface" type
  usb: gadget: udc: fix NULL dereference in remove()
  usb: gadget: f_fs: Add unbind event before functionfs_unbind
  usb: cdns3: fix NCM gadget RX speed 20x slow than expection at iMX8QM

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
"ARM:

   - Address some fallout of the locking rework, this time affecting the
     way the vgic is configured

   - Fix an issue where the page table walker frees a subtree and then
     proceeds with walking what it has just freed...

   - Check that a given PA donated to the guest is actually memory (only
     affecting pKVM)

   - Correctly handle MTE CMOs by Set/Way

   - Fix the reported address of a watchpoint forwarded to userspace

   - Fix the freeing of the root of stage-2 page tables

   - Stop creating spurious PMU events to perform detection of the
     default PMU and use the existing PMU list instead

  x86:

   - Fix a memslot lookup bug in the NX recovery thread that could
     theoretically let userspace bypass the NX hugepage mitigation

   - Fix a s/BLOCKING/PENDING bug in SVM's vNMI support

   - Account exit stats for fastpath VM-Exits that never leave the super
     tight run-loop

   - Fix an out-of-bounds bug in the optimized APIC map code, and add a
     regression test for the race"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: selftests: Add test for race in kvm_recalculate_apic_map()
  KVM: x86: Bail from kvm_recalculate_phys_map() if x2APIC ID is out-of-bounds
  KVM: x86: Account fastpath-only VM-Exits in vCPU stats
  KVM: SVM: vNMI pending bit is V_NMI_PENDING_MASK not V_NMI_BLOCKING_MASK
  KVM: x86/mmu: Grab memslot for correct address space in NX recovery worker
  KVM: arm64: Document default vPMU behavior on heterogeneous systems
  KVM: arm64: Iterate arm_pmus list to probe for default PMU
  KVM: arm64: Drop last page ref in kvm_pgtable_stage2_free_removed()
  KVM: arm64: Populate fault info for watchpoint
  KVM: arm64: Reload PTE after invoking walker callback on preorder traversal
  KVM: arm64: Handle trap of tagged Set/Way CMOs
  arm64: Add missing Set/Way CMO encodings
  KVM: arm64: Prevent unconditional donation of unmapped regions from the host
  KVM: arm64: vgic: Fix a comment
  KVM: arm64: vgic: Fix locking comment
  KVM: arm64: vgic: Wrap vgic_its_create() with config_lock
  KVM: arm64: vgic: Fix a circular locking issue

Merge tag 'powerpc-6.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

- Fix link errors in new aes-gcm-p10 code when built-in with other
   drivers

- Limit number of TCEs passed to H_STUFF_TCE hcall as per spec

- Use KSYM_NAME_LEN in xmon array size to avoid possible OOB write

Thanks to Gaurav Batra and Maninder Singh Vishal Chourasia.

* tag 'powerpc-6.4-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/xmon: Use KSYM_NAME_LEN in array size
  powerpc/iommu: Limit number of TCEs to 512 for H_STUFF_TCE hcall
  powerpc/crypto: Fix aes-gcm-p10 link errors

net/smc: Avoid to access invalid RMBs' MRs in SMCRv1 ADD LINK CONT

SMCRv1 has a similar issue to SMCRv2 (see link below) that may access
invalid MRs of RMBs when construct LLC ADD LINK CONT messages.

BUG: kernel NULL pointer dereference, address: 0000000000000014
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 5 PID: 48 Comm: kworker/5:0 Kdump: loaded Tainted: G W   E      6.4.0-rc3+ #49
Workqueue: events smc_llc_add_link_work [smc]
RIP: 0010:smc_llc_add_link_cont+0x160/0x270 [smc]
RSP: 0018:ffffa737801d3d50 EFLAGS: 00010286
RAX: ffff964f82144000 RBX: ffffa737801d3dd8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff964f81370c30
RBP: ffffa737801d3dd4 R08: ffff964f81370000 R09: ffffa737801d3db0
R10: 0000000000000001 R11: 0000000000000060 R12: ffff964f82e70000
R13: ffff964f81370c38 R14: ffffa737801d3dd3 R15: 0000000000000001
FS:  0000000000000000(0000) GS:ffff9652bfd40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000014 CR3: 000000008fa20004 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  <TASK>
  smc_llc_srv_rkey_exchange+0xa7/0x190 [smc]
  smc_llc_srv_add_link+0x3ae/0x5a0 [smc]
  smc_llc_add_link_work+0xb8/0x140 [smc]
  process_one_work+0x1e5/0x3f0
  worker_thread+0x4d/0x2f0
  ? __pfx_worker_thread+0x10/0x10
  kthread+0xe5/0x120
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x2c/0x50
  </TASK>

When an alernate RNIC is available in system, SMC will try to add a new
link based on the RNIC for resilience. All the RMBs in use will be mapped
to the new link. Then the RMBs' MRs corresponding to the new link will
be filled into LLC messages. For SMCRv1, they are ADD LINK CONT messages.

However smc_llc_add_link_cont() may mistakenly access to unused RMBs which
haven't been mapped to the new link and have no valid MRs, thus causing a
crash. So this patch fixes it.

Fixes: 87f88cda2128 ("net/smc: rkey processing for a new link as SMC client")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Wen Gu <[email protected]>
Reviewed-by: Wenjia Zhang <[email protected]>
Reviewed-by: Tony Lu <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Fix gitignore for recently added usptream self tests

This resolves the issue that generated binary is showing up as an untracked git file after every build on the kernel.

Signed-off-by: Weihao Gao <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

Merge tag 'kvm-x86-fixes-6.4' of https://github.com/kvm-x86/linux into HEAD

KVM x86 fixes for 6.4

- Fix a memslot lookup bug in the NX recovery thread that could
   theoretically let userspace bypass the NX hugepage mitigation

- Fix a s/BLOCKING/PENDING bug in SVM's vNMI support

- Account exit stats for fastpath VM-Exits that never leave the super
   tight run-loop

- Fix an out-of-bounds bug in the optimized APIC map code, and add a
   regression test for the race.

Merge tag 'kvmarm-fixes-6.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.4, take #3

- Fix the reported address of a watchpoint forwarded to userspace

- Fix the freeing of the root of stage-2 page tables

- Stop creating spurious PMU events to perform detection of the
default PMU and use the existing PMU list instead.

Merge tag 'kvmarm-fixes-6.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.4, take #2

- Address some fallout of the locking rework, this time affecting
  the way the vgic is configured

- Fix an issue where the page table walker frees a subtree and
  then proceeds with walking what it has just freed...

- Check that a given PA donated to the gues is actually memory
  (only affecting pKVM)

- Correctly handle MTE CMOs by Set/Way

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
"Five fixes, all in drivers.

  The most extensive is the target change to fix the hang in the login
  code, which involves changing timers from per login to per connection"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: stex: Fix gcc 13 warnings
  scsi: qla2xxx: Fix NULL pointer dereference in target mode
  scsi: target: iscsi: Prevent login threads from racing between each other
  scsi: target: iscsi: Remove unused transport_timer
  scsi: target: iscsi: Fix hang in the iSCSI login code

Merge tag 'leds-6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/johan/linux

Pull LED fix from Johan Hovold:
"Here's a fix for a regression in 6.4-rc1 which broke the backlight on
machines such as the Lenovo ThinkPad X13s"

Acked-by: Lee Jones <[email protected]>
Link: https://lore.kernel.org/lkml/[email protected]/
* tag 'leds-6.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/johan/linux:
leds: qcom-lpg: Fix PWM period limits

leds: qcom-lpg: Fix PWM period limits

The introduction of high resolution PWM support changed the order of the
operations in the calculation of min and max period. The result in both
divisions is in most cases a truncation to 0, which limits the period to
the range of [0, 0].

Both numerators (and denominators) are within 64 bits, so the whole
expression can be put directly into the div64_u64, instead of doing it
partially.

Fixes: b00d2ed37617 ("leds: rgb: leds-qcom-lpg: Add support for high resolution PWM")
Reviewed-by: Caleb Connolly <[email protected]>
Tested-by: Steev Klimaszewski <[email protected]>
Signed-off-by: Bjorn Andersson <[email protected]>
Acked-by: Lee Jones <[email protected]>
Tested-by: Johan Hovold <[email protected]>
Tested-by: Neil Armstrong <[email protected]> # on SM8550-QRD
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Johan Hovold <[email protected]>

Merge tag 'probes-fixes-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fixes from Masami Hiramatsu:

- Return NULL if the trace_probe list on trace_probe_event is empty

- selftests/ftrace: Choose testing symbol name for filtering feature
   from sample data instead of fixed symbol

* tag 'probes-fixes-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  selftests/ftrace: Choose target function for filter test from samples
  tracing/probe: trace_probe_primary_from_call(): checked list_first_entry

net: phylink: actually fix ksettings_set() ethtool call

Raju Lakkaraju reported that the below commit caused a regression
with Lan743x drivers and a 2.5G SFP. Sadly, this is because the commit
was utterly wrong. Let's fix this properly by not moving the
linkmode_and(), but instead copying the link ksettings and then
modifying the advertising mask before passing the modified link
ksettings to phylib.

Fixes: df0acdc59b09 ("net: phylink: fix ksettings_set() ethtool call")
Signed-off-by: Russell King (Oracle) <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

selftests/ftrace: Choose target function for filter test from samples

Since the event-filter-function.tc expects the 'exit_mmap()' directly
calls 'kmem_cache_free()', this is vulnerable to code modifications.

Choose the target function for the filter test from the sample
event data so that it can keep test running correctly even if the caller
function name will be changed.

Link: https://lore.kernel.org/linux-trace-kernel/167919441260.1922645.18355804179347364057.stgit@mhiramat.roam.corp.google.com/
Link: https://lore.kernel.org/all/CA+G9fYtF-XEKi9YNGgR=Kf==7iRb2FrmEC7qtwAeQbfyah-UhA@mail.gmail.com/
Reported-by: Linux Kernel Functional Testing <[email protected]>
Fixes: 7f09d639b8c4 ("tracing/selftests: Add test for event filtering on function name")
Signed-off-by: Masami Hiramatsu (Google) <[email protected]>
Acked-by: Steven Rostedt (Google) <[email protected]>

Merge branch 'net-ipv6-skip_notify_on_dev_down-fix'

Eric Dumazet says:

====================
net/ipv6: skip_notify_on_dev_down fix

While reviewing Matthieu Baerts recent patch [1], I found it copied/pasted
an existing bug around skip_notify_on_dev_down.

First patch is a stable candidate, and second one can simply land
in net tree.

https://lore.kernel.org/lkml/20230601-net-next-skip_print_link_becomes_ready-v1-1-c13e64c14095@tessares.net/
====================

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>

net/ipv6: convert skip_notify_on_dev_down sysctl to u8

Save a bit a space, and could help future sysctls to
use the same pattern.

Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Acked-by: Matthieu Baerts <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>

net/ipv6: fix bool/int mismatch for skip_notify_on_dev_down

skip_notify_on_dev_down ctl table expects this field
to be an int (4 bytes), not a bool (1 byte).

Because proc_dou8vec_minmax() was added in 5.13,
this patch converts skip_notify_on_dev_down to an int.

Following patch then converts the field to u8 and use proc_dou8vec_minmax().

Fixes: 7c6bb7d2faaf ("net/ipv6: Add knob to skip DELROUTE message on device down")
Signed-off-by: Eric Dumazet <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Acked-by: Matthieu Baerts <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>

KVM: selftests: Add test for race in kvm_recalculate_apic_map()

Keep switching between LAPIC_MODE_X2APIC and LAPIC_MODE_DISABLED during
APIC map construction to hunt for TOCTOU bugs in KVM. KVM's optimized map
recalc makes multiple passes over the list of vCPUs, and the calculations
ignore vCPU's whose APIC is hardware-disabled, i.e. there's a window where
toggling LAPIC_MODE_DISABLED is quite interesting.

Signed-off-by: Michal Luczaj <[email protected]>
Co-developed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86: Bail from kvm_recalculate_phys_map() if x2APIC ID is out-of-bounds

Bail from kvm_recalculate_phys_map() and disable the optimized map if the
target vCPU's x2APIC ID is out-of-bounds, i.e. if the vCPU was added
and/or enabled its local APIC after the map was allocated.  This fixes an
out-of-bounds access bug in the !x2apic_format path where KVM would write
beyond the end of phys_map.

Check the x2APIC ID regardless of whether or not x2APIC is enabled,
as KVM's hardcodes x2APIC ID to be the vCPU ID, i.e. it can't change, and
the map allocation in kvm_recalculate_apic_map() doesn't check for x2APIC
being enabled, i.e. the check won't get false postivies.

Note, this also affects the x2apic_format path, which previously just
ignored the "x2apic_id > new->max_apic_id" case.  That too is arguably a
bug fix, as ignoring the vCPU meant that KVM would not send interrupts to
the vCPU until the next map recalculation.  In practice, that "bug" is
likely benign as a newly present vCPU/APIC would immediately trigger a
recalc.  But, there's no functional downside to disabling the map, and
a future patch will gracefully handle the -E2BIG case by retrying instead
of simply disabling the optimized map.

Opportunistically add a sanity check on the xAPIC ID size, along with a
comment explaining why the xAPIC ID is guaranteed to be "good".

Reported-by: Michal Luczaj <[email protected]>
Fixes: 5b84b0291702 ("KVM: x86: Honor architectural behavior for aliased 8-bit APIC IDs")
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

Merge branch 'Fix elem_size not being set for inner maps'

Rhys Rustad-Elliott says:

====================

Commit d937bc3449fa ("bpf: make uniform use of array->elem_size
everywhere in arraymap.c") changed array_map_gen_lookup to use
array->elem_size instead of round_up(map->value_size, 8) as the element
size when generating code to access a value in an array map.

array->elem_size, however, is not set by bpf_map_meta_alloc when
initializing an BPF_MAP_TYPE_ARRAY_OF_MAPS or BPF_MAP_TYPE_HASH_OF_MAPS.
This results in array_map_gen_lookup incorrectly outputting code that
always accesses index 0 in the array (as the index will be calculated
via a multiplication with the element size, which is incorrectly set to
0).

This patchset sets elem_size on the bpf_array object when allocating an
array or hash of maps to fix this and adds a selftest that accesses an
array map nested within a hash of maps at a nonzero index to prevent
regressions.

v1: https://lore.kernel.org/bpf/95b5da7c-ee52-3ecb-0a4e-f6a7a114f269@linux.dev/

Changelog:

v1 -> v2:

Address comments by Martin KaFai Lau:
- Directly use inner_array->elem_size instead of using round_up
- Move selftests to a new patch
- Use ASSERT_* macros instead of CHECK and remove duration
- Remove unnecessary usleep
- Shorten selftest name
====================

Signed-off-by: Martin KaFai Lau <[email protected]>

selftests/bpf: Add access_inner_map selftest

Add a selftest that accesses a BPF_MAP_TYPE_ARRAY (at a nonzero index)
nested within a BPF_MAP_TYPE_HASH_OF_MAPS to flex a previously buggy
case.

Signed-off-by: Rhys Rustad-Elliott <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>

KVM: x86: Account fastpath-only VM-Exits in vCPU stats

Increment vcpu->stat.exits when handling a fastpath VM-Exit without
going through any part of the "slow" path. Not bumping the exits stat
can result in wildly misleading exit counts, e.g. if the primary reason
the guest is exiting is to program the TSC deadline timer.

Fixes: 404d5d7bff0d ("KVM: X86: Introduce more exit_fastpath_completion enum values")
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

KVM: SVM: vNMI pending bit is V_NMI_PENDING_MASK not V_NMI_BLOCKING_MASK

While testing Hyper-V enabled Windows Server 2019 guests on Zen4 hardware
I noticed that with vCPU count large enough (> 16) they sometimes froze at
boot.
With vCPU count of 64 they never booted successfully - suggesting some kind
of a race condition.

Since adding "vnmi=0" module parameter made these guests boot successfully
it was clear that the problem is most likely (v)NMI-related.

Running kvm-unit-tests quickly showed failing NMI-related tests cases, like
"multiple nmi" and "pending nmi" from apic-split, x2apic and xapic tests
and the NMI parts of eventinj test.

The issue was that once one NMI was being serviced no other NMI was allowed
to be set pending (NMI limit = 0), which was traced to
svm_is_vnmi_pending() wrongly testing for the "NMI blocked" flag rather
than for the "NMI pending" flag.

Fix this by testing for the right flag in svm_is_vnmi_pending().
Once this is done, the NMI-related kvm-unit-tests pass successfully and
the Windows guest no longer freezes at boot.

Fixes: fa4c027a7956 ("KVM: x86: Add support for SVM's Virtual NMI")
Signed-off-by: Maciej S. Szmigiero <[email protected]>
Reviewed-by: Sean Christopherson <[email protected]>
Link: https://lore.kernel.org/r/be4ca192eb0c1e69a210db3009ca984e6a54ae69.1684495380.git.maciej.szmigiero@oracle.com
Signed-off-by: Sean Christopherson <[email protected]>

KVM: x86/mmu: Grab memslot for correct address space in NX recovery worker

Factor in the address space (non-SMM vs. SMM) of the target shadow page
when recovering potential NX huge pages, otherwise KVM will retrieve the
wrong memslot when zapping shadow pages that were created for SMM.  The
bug most visibly manifests as a WARN on the memslot being non-NULL, but
the worst case scenario is that KVM could unaccount the shadow page
without ensuring KVM won't install a huge page, i.e. if the non-SMM slot
is being dirty logged, but the SMM slot is not.

------------[ cut here ]------------
WARNING: CPU: 1 PID: 3911 at arch/x86/kvm/mmu/mmu.c:7015
kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm]
CPU: 1 PID: 3911 Comm: kvm-nx-lpage-re
RIP: 0010:kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm]
RSP: 0018:ffff99b284f0be68 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff99b284edd000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff9271397024e0 R08: 0000000000000000 R09: ffff927139702450
R10: 0000000000000000 R11: 0000000000000001 R12: ffff99b284f0be98
R13: 0000000000000000 R14: ffff9270991fcd80 R15: 0000000000000003
FS:  0000000000000000(0000) GS:ffff927f9f640000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0aacad3ae0 CR3: 000000088fc2c005 CR4: 00000000003726e0
Call Trace:
  <TASK>
__pfx_kvm_nx_huge_page_recovery_worker+0x10/0x10 [kvm]
  kvm_vm_worker_thread+0x106/0x1c0 [kvm]
  kthread+0xd9/0x100
  ret_from_fork+0x2c/0x50
  </TASK>
---[ end trace 0000000000000000 ]---

This bug was exposed by commit edbdb43fc96b ("KVM: x86: Preserve TDP MMU
roots until they are explicitly invalidated"), which allowed KVM to retain
SMM TDP MMU roots effectively indefinitely.  Before commit edbdb43fc96b,
KVM would zap all SMM TDP MMU roots and thus all SMM TDP MMU shadow pages
once all vCPUs exited SMM, which made the window where this bug (recovering
an SMM NX huge page) could be encountered quite tiny.  To hit the bug, the
NX recovery thread would have to run while at least one vCPU was in SMM.
Most VMs typically only use SMM during boot, and so the problematic shadow
pages were gone by the time the NX recovery thread ran.

Now that KVM preserves TDP MMU roots until they are explicitly invalidated
(e.g. by a memslot deletion), the window to trigger the bug is effectively
never closed because most VMMs don't delete memslots after boot (except
for a handful of special scenarios).

Fixes: eb298605705a ("KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages")
Reported-by: Fabio Coatti <[email protected]>
Closes: https://lore.kernel.org/all/CADpTngX9LESCdHVu_2mQkNGena_Ng2CphWNwsRGSMxzDsTjU2A@mail.gmail.com
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sean Christopherson <[email protected]>

bpf: Fix elem_size not being set for inner maps

Commit d937bc3449fa ("bpf: make uniform use of array->elem_size
everywhere in arraymap.c") changed array_map_gen_lookup to use
array->elem_size instead of round_up(map->value_size, 8) as the element
size when generating code to access a value in an array map.

array->elem_size, however, is not set by bpf_map_meta_alloc when
initializing an BPF_MAP_TYPE_ARRAY_OF_MAPS or BPF_MAP_TYPE_HASH_OF_MAPS.
This results in array_map_gen_lookup incorrectly outputting code that
always accesses index 0 in the array (as the index will be calculated
via a multiplication with the element size, which is incorrectly set to
0).

Set elem_size on the bpf_array object when allocating an array or hash
of maps to fix this.

Fixes: d937bc3449fa ("bpf: make uniform use of array->elem_size everywhere in arraymap.c")
Signed-off-by: Rhys Rustad-Elliott <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Martin KaFai Lau <[email protected]>

tpm, tpm_tis: correct tpm_tis_flags enumeration values

With commit 858e8b792d06 ("tpm, tpm_tis: Avoid cache incoherency in test
for interrupts") bit accessor functions are used to access flags in
tpm_tis_data->flags.

However these functions expect bit numbers, while the flags are defined
as bit masks in enum tpm_tis_flag.

Fix this inconsistency by using numbers instead of masks also for the
flags in the enum.

Reported-by: Pavel Machek <[email protected]>
Fixes: 858e8b792d06 ("tpm, tpm_tis: Avoid cache incoherency in test for interrupts")
Signed-off-by: Lino Sanfilippo <[email protected]>
Cc: [email protected]
Reviewed-by: Pavel Machek <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fix from Ted Ts'o:
"Fix an ext4 regression which landed during the 6.4 merge window"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
Revert "ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits"

Merge tag 'for-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
"One regression fix.

  The rewrite of scrub code in 6.4 broke device replace in zoned mode,
  some of the writes could happen out of order so this had to be
  adjusted for all cases"

* tag 'for-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zoned: fix dev-replace after the scrub rework

Revert "ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits"

This reverts commit 32c0869370194ae5ac9f9f501953ef693040f6a1.

The reverted commit was intended to remove a dead check however it was observed
that this check was actually being used to exit early instead of looping
sbi->s_mb_max_to_scan times when we are able to find a free extent bigger than
the goal extent. Due to this, a my performance tests (fsmark, parallel file
writes in a highly fragmented FS) were seeing a 2x-3x regression.

Example, the default value of the following variables is:

sbi->s_mb_max_to_scan = 200
sbi->s_mb_min_to_scan = 10

In ext4_mb_check_limits() if we find an extent smaller than goal, then we return
early and try again. This loop will go on until we have processed
sbi->s_mb_max_to_scan(=200) number of free extents at which point we exit and
just use whatever we have even if it is smaller than goal extent.

Now, the regression comes when we find an extent bigger than goal. Earlier, in
this case we would loop only sbi->s_mb_min_to_scan(=10) times and then just use
the bigger extent. However with commit 32c08693 that check was removed and hence
we would loop sbi->s_mb_max_to_scan(=200) times even though we have a big enough
free extent to satisfy the request. The only time we would exit early would be
when the free extent is *exactly* the size of our goal, which is pretty uncommon
occurrence and so we would almost always end up looping 200 times.

Hence, revert the commit by adding the check back to fix the regression. Also
add a comment to outline this policy.

Fixes: 32c086937019 ("ext4: remove ac->ac_found > sbi->s_mb_min_to_scan dead check in ext4_mb_check_limits")
Signed-off-by: Ojaswin Mujoo <[email protected]>
Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
Reviewed-by: Kemeng Shi <[email protected]>
Link: https://lore.kernel.org/r/ddcae9658e46880dfec2fb0aa61d01fb3353d202.1685449706.git.ojaswin@linux.ibm.com
Signed-off-by: Theodore Ts'o <[email protected]>