]> Git Repo - linux.git/log
linux.git
4 months agoMerge tag 'v6.12-p5' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Linus Torvalds [Fri, 15 Nov 2024 18:04:39 +0000 (10:04 -0800)]
Merge tag 'v6.12-p5' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto fix from Herbert Xu:
 "Fix a regression in the MIPS CRC32C code"

* tag 'v6.12-p5' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: mips/crc32 - fix the CRC32C implementation

4 months agoMerge tag 'sched_ext-for-6.12-rc7-fixes-2' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Fri, 15 Nov 2024 17:59:51 +0000 (09:59 -0800)]
Merge tag 'sched_ext-for-6.12-rc7-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fix from Tejun Heo:
 "One more fix for v6.12-rc7

  ops.cpu_acquire() was being invoked with the wrong kfunc mask allowing
  the operation to call kfuncs which shouldn't be allowed. Fix it by
  using SCX_KF_REST instead, which is trivial and low risk"

* tag 'sched_ext-for-6.12-rc7-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: ops.cpu_acquire() should be called with SCX_KF_REST

4 months agoMerge tag 'for-6.12-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave...
Linus Torvalds [Fri, 15 Nov 2024 17:45:32 +0000 (09:45 -0800)]
Merge tag 'for-6.12-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "One more fix that seems urgent and good to have in 6.12 final.

  It could potentially lead to unexpected transaction aborts, due to
  wrong comparison and order of processing of delayed refs"

* tag 'for-6.12-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix incorrect comparison for delayed refs

4 months agosched_ext: ops.cpu_acquire() should be called with SCX_KF_REST
Tejun Heo [Thu, 14 Nov 2024 18:50:58 +0000 (08:50 -1000)]
sched_ext: ops.cpu_acquire() should be called with SCX_KF_REST

ops.cpu_acquire() is currently called with 0 kf_maks which is interpreted as
SCX_KF_UNLOCKED which allows all unlocked kfuncs, but ops.cpu_acquire() is
called from balance_one() under the rq lock and should only be allowed call
kfuncs that are safe under the rq lock. Update it to use SCX_KF_REST.

Signed-off-by: Tejun Heo <[email protected]>
Cc: David Vernet <[email protected]>
Cc: Zhao Mengmeng <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Fixes: 245254f7081d ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()")
4 months agoMerge tag 'net-6.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 14 Nov 2024 18:05:33 +0000 (10:05 -0800)]
Merge tag 'net-6.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from bluetooth.

  Quite calm week. No new regression under investigation.

  Current release - regressions:

   - eth: revert "igb: Disable threaded IRQ for igb_msix_other"

  Current release - new code bugs:

   - bluetooth: btintel: direct exception event to bluetooth stack

  Previous releases - regressions:

   - core: fix data-races around sk->sk_forward_alloc

   - netlink: terminate outstanding dump on socket close

   - mptcp: error out earlier on disconnect

   - vsock: fix accept_queue memory leak

   - phylink: ensure PHY momentary link-fails are handled

   - eth: mlx5:
      - fix null-ptr-deref in add rule err flow
      - lock FTE when checking if active

   - eth: dwmac-mediatek: fix inverted handling of mediatek,mac-wol

  Previous releases - always broken:

   - sched: fix u32's systematic failure to free IDR entries for hnodes.

   - sctp: fix possible UAF in sctp_v6_available()

   - eth: bonding: add ns target multicast address to slave device

   - eth: mlx5: fix msix vectors to respect platform limit

   - eth: icssg-prueth: fix 1 PPS sync"

* tag 'net-6.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits)
  net: sched: u32: Add test case for systematic hnode IDR leaks
  selftests: bonding: add ns multicast group testing
  bonding: add ns target multicast address to slave device
  net: ti: icssg-prueth: Fix 1 PPS sync
  stmmac: dwmac-intel-plat: fix call balance of tx_clk handling routines
  net: Make copy_safe_from_sockptr() match documentation
  net: stmmac: dwmac-mediatek: Fix inverted handling of mediatek,mac-wol
  ipmr: Fix access to mfc_cache_list without lock held
  samples: pktgen: correct dev to DEV
  net: phylink: ensure PHY momentary link-fails are handled
  mptcp: pm: use _rcu variant under rcu_read_lock
  mptcp: hold pm lock when deleting entry
  mptcp: update local address flags when setting it
  net: sched: cls_u32: Fix u32's systematic failure to free IDR entries for hnodes.
  MAINTAINERS: Re-add cancelled Renesas driver sections
  Revert "igb: Disable threaded IRQ for igb_msix_other"
  Bluetooth: btintel: Direct exception event to bluetooth stack
  Bluetooth: hci_core: Fix calling mgmt_device_connected
  virtio/vsock: Improve MSG_ZEROCOPY error handling
  vsock: Fix sk_error_queue memory leak
  ...

4 months agoMerge tag 'bcachefs-2024-11-13' of git://evilpiepirate.org/bcachefs
Linus Torvalds [Thu, 14 Nov 2024 18:00:23 +0000 (10:00 -0800)]
Merge tag 'bcachefs-2024-11-13' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "This fixes one minor regression from the btree cache fixes (in the
  scan_for_btree_nodes repair path) - and the shutdown path fix is the
  big one here, in terms of bugs closed:

   - Assorted tiny syzbot fixes

   - Shutdown path fix: "bch2_btree_write_buffer_flush_going_ro()"

     The shutdown path wasn't flushing the btree write buffer, leading
     to shutting down while we still had operations in flight. This
     fixes a whole slew of syzbot bugs, and undoubtedly other strange
     heisenbugs.

* tag 'bcachefs-2024-11-13' of git://evilpiepirate.org/bcachefs:
  bcachefs: Fix assertion pop in bch2_ptr_swab()
  bcachefs: Fix journal_entry_dev_usage_to_text() overrun
  bcachefs: Allow for unknown key types in backpointers fsck
  bcachefs: Fix assertion pop in topology repair
  bcachefs: Fix hidden btree errors when reading roots
  bcachefs: Fix validate_bset() repair path
  bcachefs: Fix missing validation for bch_backpointer.level
  bcachefs: Fix bch_member.btree_bitmap_shift validation
  bcachefs: bch2_btree_write_buffer_flush_going_ro()

4 months agobtrfs: fix incorrect comparison for delayed refs
Josef Bacik [Wed, 13 Nov 2024 16:05:13 +0000 (11:05 -0500)]
btrfs: fix incorrect comparison for delayed refs

When I reworked delayed ref comparison in cf4f04325b2b ("btrfs: move
->parent and ->ref_root into btrfs_delayed_ref_node"), I made a mistake
and returned -1 for the case where ref1->ref_root was > than
ref2->ref_root.  This is a subtle bug that can result in improper
delayed ref running order, which can result in transaction aborts.

Fixes: cf4f04325b2b ("btrfs: move ->parent and ->ref_root into btrfs_delayed_ref_node")
CC: [email protected] # 6.10+
Reviewed-by: Filipe Manana <[email protected]>
Reviewed-by: Qu Wenruo <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
4 months agonet: sched: u32: Add test case for systematic hnode IDR leaks
Alexandre Ferrieux [Wed, 13 Nov 2024 10:04:28 +0000 (11:04 +0100)]
net: sched: u32: Add test case for systematic hnode IDR leaks

Add a tdc test case to exercise the just-fixed systematic leak of
IDR entries in u32 hnode disposal. Given the IDR in question is
confined to the range [1..0x7FF], it is sufficient to create/delete
the same filter 2048 times to fill it up and get a nonzero exit
status from "tc filter add".

Signed-off-by: Alexandre Ferrieux <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Reviewed-by: Victor Nogueira <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
4 months agoMerge branch 'bonding-fix-ns-targets-not-work-on-hardware-nic'
Paolo Abeni [Thu, 14 Nov 2024 10:16:30 +0000 (11:16 +0100)]
Merge branch 'bonding-fix-ns-targets-not-work-on-hardware-nic'

Hangbin Liu says:

====================
bonding: fix ns targets not work on hardware NIC

The first patch fixed ns targets not work on hardware NIC when bonding
set arp_validate.

The second patch add a related selftest for bonding.

v4: Thanks Nikolay for the comments:
    use bond_slave_ns_maddrs_{add/del} with clear name
    fix comments typos
    remove _slave_set_ns_maddrs underscore directly
    update bond_option_arp_validate_set() change logic
v3: use ndisc_mc_map to convert the mcast mac address (Jay Vosburgh)
v2: only add/del mcast group on backup slaves when arp_validate is set (Jay Vosburgh)
    arp_validate doesn't support 3ad, tlb, alb. So let's only do it on ab mode.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
4 months agoselftests: bonding: add ns multicast group testing
Hangbin Liu [Mon, 11 Nov 2024 10:16:50 +0000 (10:16 +0000)]
selftests: bonding: add ns multicast group testing

Add a test to make sure the backup slaves join correct multicast group
when arp_validate enabled and ns_ip6_target is set. Here is the result:

TEST: arp_validate (active-backup ns_ip6_target arp_validate 0)     [ OK ]
TEST: arp_validate (join mcast group)                               [ OK ]
TEST: arp_validate (active-backup ns_ip6_target arp_validate 1)     [ OK ]
TEST: arp_validate (join mcast group)                               [ OK ]
TEST: arp_validate (active-backup ns_ip6_target arp_validate 2)     [ OK ]
TEST: arp_validate (join mcast group)                               [ OK ]
TEST: arp_validate (active-backup ns_ip6_target arp_validate 3)     [ OK ]
TEST: arp_validate (join mcast group)                               [ OK ]
TEST: arp_validate (active-backup ns_ip6_target arp_validate 4)     [ OK ]
TEST: arp_validate (join mcast group)                               [ OK ]
TEST: arp_validate (active-backup ns_ip6_target arp_validate 5)     [ OK ]
TEST: arp_validate (join mcast group)                               [ OK ]
TEST: arp_validate (active-backup ns_ip6_target arp_validate 6)     [ OK ]
TEST: arp_validate (join mcast group)                               [ OK ]

Signed-off-by: Hangbin Liu <[email protected]>
Reviewed-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
4 months agobonding: add ns target multicast address to slave device
Hangbin Liu [Mon, 11 Nov 2024 10:16:49 +0000 (10:16 +0000)]
bonding: add ns target multicast address to slave device

Commit 4598380f9c54 ("bonding: fix ns validation on backup slaves")
tried to resolve the issue where backup slaves couldn't be brought up when
receiving IPv6 Neighbor Solicitation (NS) messages. However, this fix only
worked for drivers that receive all multicast messages, such as the veth
interface.

For standard drivers, the NS multicast message is silently dropped because
the slave device is not a member of the NS target multicast group.

To address this, we need to make the slave device join the NS target
multicast group, ensuring it can receive these IPv6 NS messages to validate
the slave’s status properly.

There are three policies before joining the multicast group:
1. All settings must be under active-backup mode (alb and tlb do not support
   arp_validate), with backup slaves and slaves supporting multicast.
2. We can add or remove multicast groups when arp_validate changes.
3. Other operations, such as enslaving, releasing, or setting NS targets,
   need to be guarded by arp_validate.

Fixes: 4e24be018eb9 ("bonding: add new parameter ns_targets")
Signed-off-by: Hangbin Liu <[email protected]>
Reviewed-by: Nikolay Aleksandrov <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
4 months agonet: ti: icssg-prueth: Fix 1 PPS sync
Meghana Malladi [Mon, 11 Nov 2024 09:58:42 +0000 (15:28 +0530)]
net: ti: icssg-prueth: Fix 1 PPS sync

The first PPS latch time needs to be calculated by the driver
(in rounded off seconds) and configured as the start time
offset for the cycle. After synchronizing two PTP clocks
running as master/slave, missing this would cause master
and slave to start immediately with some milliseconds
drift which causes the PPS signal to never synchronize with
the PTP master.

Fixes: 186734c15886 ("net: ti: icssg-prueth: add packet timestamping and ptp support")
Signed-off-by: Meghana Malladi <[email protected]>
Reviewed-by: Vadim Fedorenko <[email protected]>
Reviewed-by: MD Danish Anwar <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
4 months agostmmac: dwmac-intel-plat: fix call balance of tx_clk handling routines
Vitalii Mordan [Fri, 8 Nov 2024 17:33:34 +0000 (20:33 +0300)]
stmmac: dwmac-intel-plat: fix call balance of tx_clk handling routines

If the clock dwmac->tx_clk was not enabled in intel_eth_plat_probe,
it should not be disabled in any path.

Conversely, if it was enabled in intel_eth_plat_probe, it must be disabled
in all error paths to ensure proper cleanup.

Found by Linux Verification Center (linuxtesting.org) with Klever.

Fixes: 9efc9b2b04c7 ("net: stmmac: Add dwmac-intel-plat for GBE driver")
Signed-off-by: Vitalii Mordan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agonet: Make copy_safe_from_sockptr() match documentation
Michal Luczaj [Sun, 10 Nov 2024 23:17:34 +0000 (00:17 +0100)]
net: Make copy_safe_from_sockptr() match documentation

copy_safe_from_sockptr()
  return copy_from_sockptr()
    return copy_from_sockptr_offset()
      return copy_from_user()

copy_from_user() does not return an error on fault. Instead, it returns a
number of bytes that were not copied. Have it handled.

Patch has a side effect: it un-breaks garbage input handling of
nfc_llcp_setsockopt() and mISDN's data_sock_setsockopt().

Fixes: 6309863b31dd ("net: add copy_safe_from_sockptr() helper")
Signed-off-by: Michal Luczaj <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agonet: stmmac: dwmac-mediatek: Fix inverted handling of mediatek,mac-wol
Nícolas F. R. A. Prado [Sat, 9 Nov 2024 15:16:32 +0000 (10:16 -0500)]
net: stmmac: dwmac-mediatek: Fix inverted handling of mediatek,mac-wol

The mediatek,mac-wol property is being handled backwards to what is
described in the binding: it currently enables PHY WOL when the property
is present and vice versa. Invert the driver logic so it matches the
binding description.

Fixes: fd1d62d80ebc ("net: stmmac: replace the use_phy_wol field with a flag")
Signed-off-by: Nícolas F. R. A. Prado <[email protected]>
Link: https://patch.msgid.link/20241109-mediatek-mac-wol-noninverted-v2-1-0e264e213878@collabora.com
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agoipmr: Fix access to mfc_cache_list without lock held
Breno Leitao [Fri, 8 Nov 2024 14:08:36 +0000 (06:08 -0800)]
ipmr: Fix access to mfc_cache_list without lock held

Accessing `mr_table->mfc_cache_list` is protected by an RCU lock. In the
following code flow, the RCU read lock is not held, causing the
following error when `RCU_PROVE` is not held. The same problem might
show up in the IPv6 code path.

6.12.0-rc5-kbuilder-01145-gbac17284bdcb #33 Tainted: G            E    N
-----------------------------
net/ipv4/ipmr_base.c:313 RCU-list traversed in non-reader section!!

rcu_scheduler_active = 2, debug_locks = 1
   2 locks held by RetransmitAggre/3519:
    #0: ffff88816188c6c0 (nlk_cb_mutex-ROUTE){+.+.}-{3:3}, at: __netlink_dump_start+0x8a/0x290
    #1: ffffffff83fcf7a8 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_dumpit+0x6b/0x90

stack backtrace:
    lockdep_rcu_suspicious
    mr_table_dump
    ipmr_rtm_dumproute
    rtnl_dump_all
    rtnl_dumpit
    netlink_dump
    __netlink_dump_start
    rtnetlink_rcv_msg
    netlink_rcv_skb
    netlink_unicast
    netlink_sendmsg

This is not a problem per see, since the RTNL lock is held here, so, it
is safe to iterate in the list without the RCU read lock, as suggested
by Eric.

To alleviate the concern, modify the code to use
list_for_each_entry_rcu() with the RTNL-held argument.

The annotation will raise an error only if RTNL or RCU read lock are
missing during iteration, signaling a legitimate problem, otherwise it
will avoid this false positive.

This will solve the IPv6 case as well, since ip6mr_rtm_dumproute() calls
this function as well.

Signed-off-by: Breno Leitao <[email protected]>
Reviewed-by: David Ahern <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agosamples: pktgen: correct dev to DEV
Wei Fang [Tue, 12 Nov 2024 03:03:47 +0000 (11:03 +0800)]
samples: pktgen: correct dev to DEV

In the pktgen_sample01_simple.sh script, the device variable is uppercase
'DEV' instead of lowercase 'dev'. Because of this typo, the script cannot
enable UDP tx checksum.

Fixes: 460a9aa23de6 ("samples: pktgen: add UDP tx checksum support")
Signed-off-by: Wei Fang <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Acked-by: Jesper Dangaard Brouer <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agonet: phylink: ensure PHY momentary link-fails are handled
Russell King (Oracle) [Tue, 12 Nov 2024 16:20:00 +0000 (16:20 +0000)]
net: phylink: ensure PHY momentary link-fails are handled

Normally, phylib won't notify changes in quick succession. However, as
a result of commit 3e43b903da04 ("net: phy: Immediately call
adjust_link if only tx_lpi_enabled changes") this is no longer true -
it is now possible that phy_link_down() and phy_link_up() will both
complete before phylink's resolver has run, which means it'll miss that
pl->phy_state.link momentarily became false.

Rename "mac_link_dropped" to be more generic "link_failed" since it will
cover more than the MAC/PCS end of the link failing, and arrange to set
this in phylink_phy_change() if we notice that the PHY reports that the
link is down.

This will ensure that we capture an EEE reconfiguration event.

Fixes: 3e43b903da04 ("net: phy: Immediately call adjust_link if only tx_lpi_enabled changes")
Signed-off-by: Russell King (Oracle) <[email protected]>
Reviewed-by: Oleksij Rempel <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agoMerge branch 'mptcp-pm-a-few-more-fixes'
Jakub Kicinski [Thu, 14 Nov 2024 02:51:09 +0000 (18:51 -0800)]
Merge branch 'mptcp-pm-a-few-more-fixes'

Matthieu Baerts says:

====================
mptcp: pm: a few more fixes

Three small fixes related to the MPTCP path-manager:

- Patch 1: correctly reflect the backup flag to the corresponding local
  address entry of the userspace path-manager. A fix for v5.19.

- Patch 2: hold the PM lock when deleting an entry from the local
  addresses of the userspace path-manager to avoid messing up with this
  list. A fix for v5.19.

- Patch 3: use _rcu variant to iterate the in-kernel path-manager's
  local addresses list, when under rcu_read_lock(). A fix for v5.17.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agomptcp: pm: use _rcu variant under rcu_read_lock
Matthieu Baerts (NGI0) [Tue, 12 Nov 2024 19:18:35 +0000 (20:18 +0100)]
mptcp: pm: use _rcu variant under rcu_read_lock

In mptcp_pm_create_subflow_or_signal_addr(), rcu_read_(un)lock() are
used as expected to iterate over the list of local addresses, but
list_for_each_entry() was used instead of list_for_each_entry_rcu() in
__lookup_addr(). It is important to use this variant which adds the
required READ_ONCE() (and diagnostic checks if enabled).

Because __lookup_addr() is also used in mptcp_pm_nl_set_flags() where it
is called under the pernet->lock and not rcu_read_lock(), an extra
condition is then passed to help the diagnostic checks making sure
either the associated spin lock or the RCU lock is held.

Fixes: 86e39e04482b ("mptcp: keep track of local endpoint still available for each msk")
Cc: [email protected]
Reviewed-by: Geliang Tang <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agomptcp: hold pm lock when deleting entry
Geliang Tang [Tue, 12 Nov 2024 19:18:34 +0000 (20:18 +0100)]
mptcp: hold pm lock when deleting entry

When traversing userspace_pm_local_addr_list and deleting an entry from
it in mptcp_pm_nl_remove_doit(), msk->pm.lock should be held.

This patch holds this lock before mptcp_userspace_pm_lookup_addr_by_id()
and releases it after list_move() in mptcp_pm_nl_remove_doit().

Fixes: d9a4594edabf ("mptcp: netlink: Add MPTCP_PM_CMD_REMOVE")
Cc: [email protected]
Signed-off-by: Geliang Tang <[email protected]>
Reviewed-by: Matthieu Baerts (NGI0) <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agomptcp: update local address flags when setting it
Geliang Tang [Tue, 12 Nov 2024 19:18:33 +0000 (20:18 +0100)]
mptcp: update local address flags when setting it

Just like in-kernel pm, when userspace pm does set_flags, it needs to send
out MP_PRIO signal, and also modify the flags of the corresponding address
entry in the local address list. This patch implements the missing logic.

Traverse all address entries on userspace_pm_local_addr_list to find the
local address entry, if bkup is true, set the flags of this entry with
FLAG_BACKUP, otherwise, clear FLAG_BACKUP.

Fixes: 892f396c8e68 ("mptcp: netlink: issue MP_PRIO signals from userspace PMs")
Cc: [email protected]
Signed-off-by: Geliang Tang <[email protected]>
Reviewed-by: Matthieu Baerts (NGI0) <[email protected]>
Signed-off-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agoMerge tag 'pm-6.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Linus Torvalds [Wed, 13 Nov 2024 21:32:51 +0000 (13:32 -0800)]
Merge tag 'pm-6.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fix from Rafael Wysocki:
 "Fix a locking issue in the asymmetric CPU capacity setup code in the
  intel_pstate driver that may lead to a deadlock if CPU online/offline
  runs in parallel with the code in question, which is unlikely but not
  impossible (Rafael Wysocki)"

* tag 'pm-6.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: intel_pstate: Rearrange locking in hybrid_init_cpu_capacity_scaling()

4 months agoMerge tag 'tpmdd-next-6.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Wed, 13 Nov 2024 21:28:58 +0000 (13:28 -0800)]
Merge tag 'tpmdd-next-6.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd

Pull tpm fixes from Jarkko Sakkinen:
 "Two bug fixes for TPM bus encryption (the remaining reported issues in
  the feature)"

* tag 'tpmdd-next-6.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
  tpm: Disable TPM on tpm2_create_primary() failure
  tpm: Opt-in in disable PCR integrity protection

4 months agotpm: Disable TPM on tpm2_create_primary() failure
Jarkko Sakkinen [Wed, 13 Nov 2024 18:35:39 +0000 (20:35 +0200)]
tpm: Disable TPM on tpm2_create_primary() failure

The earlier bug fix misplaced the error-label when dealing with the
tpm2_create_primary() return value, which the original completely ignored.

Cc: [email protected]
Reported-by: Christoph Anton Mitterer <[email protected]>
Closes: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1087331
Fixes: cc7d8594342a ("tpm: Rollback tpm2_load_null()")
Signed-off-by: Jarkko Sakkinen <[email protected]>
4 months agotpm: Opt-in in disable PCR integrity protection
Jarkko Sakkinen [Wed, 13 Nov 2024 05:54:12 +0000 (07:54 +0200)]
tpm: Opt-in in disable PCR integrity protection

The initial HMAC session feature added TPM bus encryption and/or integrity
protection to various in-kernel TPM operations. This can cause performance
bottlenecks with IMA, as it heavily utilizes PCR extend operations.

In order to mitigate this performance issue, introduce a kernel
command-line parameter to the TPM driver for disabling the integrity
protection for PCR extend operations (i.e. TPM2_PCR_Extend).

Cc: James Bottomley <[email protected]>
Link: https://lore.kernel.org/linux-integrity/[email protected]/
Fixes: 6519fea6fd37 ("tpm: add hmac checks to tpm2_pcr_extend()")
Tested-by: Mimi Zohar <[email protected]>
Co-developed-by: Roberto Sassu <[email protected]>
Signed-off-by: Roberto Sassu <[email protected]>
Co-developed-by: Mimi Zohar <[email protected]>
Signed-off-by: Mimi Zohar <[email protected]>
Signed-off-by: Jarkko Sakkinen <[email protected]>
4 months agoMerge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Linus Torvalds [Wed, 13 Nov 2024 17:14:19 +0000 (09:14 -0800)]
Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Daniel Borkmann:

 - Fix a mismatching RCU unlock flavor in bpf_out_neigh_v6 (Jiawei Ye)

 - Fix BPF sockmap with kTLS to reject vsock and unix sockets upon kTLS
   context retrieval (Zijian Zhang)

 - Fix BPF bits iterator selftest for s390x (Hou Tao)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Fix mismatched RCU unlock flavour in bpf_out_neigh_v6
  bpf: Add sk_is_inet and IS_ICSK check in tls_sw_has_ctx_tx/rx
  selftests/bpf: Use -4095 as the bad address for bits iterator

4 months agoMerge tag 'loongarch-fixes-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Wed, 13 Nov 2024 17:09:00 +0000 (09:09 -0800)]
Merge tag 'loongarch-fixes-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson

Pull LoongArch fixes from Huacai Chen:

 - fix possible CPUs setup logical-physical CPU mapping, in order to
   avoid CPU hotplug issue

 - fix some KASAN bugs

 - fix AP booting issue in VM mode

 - some trivial cleanups

* tag 'loongarch-fixes-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  LoongArch: Fix AP booting issue in VM mode
  LoongArch: Add WriteCombine shadow mapping in KASAN
  LoongArch: Disable KASAN if PGDIR_SIZE is too large for cpu_vabits
  LoongArch: Make KASAN work with 5-level page-tables
  LoongArch: Define a default value for VM_DATA_DEFAULT_FLAGS
  LoongArch: Fix early_numa_add_cpu() usage for FDT systems
  LoongArch: For all possible CPUs setup logical-physical CPU mapping

4 months agoMerge tag 'mm-hotfixes-stable-2024-11-12-16-39' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Wed, 13 Nov 2024 16:58:11 +0000 (08:58 -0800)]
Merge tag 'mm-hotfixes-stable-2024-11-12-16-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "10 hotfixes, 7 of which are cc:stable. 7 are MM, 3 are not. All
  singletons"

* tag 'mm-hotfixes-stable-2024-11-12-16-39' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: swapfile: fix cluster reclaim work crash on rotational devices
  selftests: hugetlb_dio: fixup check for initial conditions to skip in the start
  mm/thp: fix deferred split queue not partially_mapped: fix
  mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases
  nommu: pass NULL argument to vma_iter_prealloc()
  ocfs2: fix UBSAN warning in ocfs2_verify_volume()
  nilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint
  nilfs2: fix null-ptr-deref in block_touch_buffer tracepoint
  mm: page_alloc: move mlocked flag clearance into free_pages_prepare()
  mm: count zeromap read and set for swapout and swapin

4 months agonet: sched: cls_u32: Fix u32's systematic failure to free IDR entries for hnodes.
Alexandre Ferrieux [Sun, 10 Nov 2024 17:28:36 +0000 (18:28 +0100)]
net: sched: cls_u32: Fix u32's systematic failure to free IDR entries for hnodes.

To generate hnode handles (in gen_new_htid()), u32 uses IDR and
encodes the returned small integer into a structured 32-bit
word. Unfortunately, at disposal time, the needed decoding
is not done. As a result, idr_remove() fails, and the IDR
fills up. Since its size is 2048, the following script ends up
with "Filter already exists":

  tc filter add dev myve $FILTER1
  tc filter add dev myve $FILTER2
  for i in {1..2048}
  do
    echo $i
    tc filter del dev myve $FILTER2
    tc filter add dev myve $FILTER2
  done

This patch adds the missing decoding logic for handles that
deserve it.

Fixes: e7614370d6f0 ("net_sched: use idr to allocate u32 filter handles")
Reviewed-by: Eric Dumazet <[email protected]>
Acked-by: Jamal Hadi Salim <[email protected]>
Signed-off-by: Alexandre Ferrieux <[email protected]>
Tested-by: Victor Nogueira <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agoMAINTAINERS: Re-add cancelled Renesas driver sections
Geert Uytterhoeven [Mon, 11 Nov 2024 10:03:21 +0000 (11:03 +0100)]
MAINTAINERS: Re-add cancelled Renesas driver sections

Removing full driver sections also removed mailing list entries, causing
submitters of future patches to forget CCing these mailing lists.

Hence re-add the sections for the Renesas Ethernet AVB, R-Car SATA, and
SuperH Ethernet drivers.  Add people who volunteered to maintain these
drivers (thanks a lot!), and mark all of them as supported.

Fixes: 6e90b675cf942e50 ("MAINTAINERS: Remove some entries due to various compliance requirements.")
Signed-off-by: Geert Uytterhoeven <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>
Reviewed-by: Simon Horman <[email protected]>
Acked-by: Niklas Cassel <[email protected]>
Acked-by: Niklas Söderlund <[email protected]>
Reviewed-by: Paul Barker <[email protected]>
Link: https://patch.msgid.link/4b2105332edca277f07ffa195796975e9ddce994.1731319098.git.geert+renesas@glider.be
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agoRevert "igb: Disable threaded IRQ for igb_msix_other"
Wander Lairson Costa [Wed, 6 Nov 2024 11:14:26 +0000 (08:14 -0300)]
Revert "igb: Disable threaded IRQ for igb_msix_other"

This reverts commit 338c4d3902feb5be49bfda530a72c7ab860e2c9f.

Sebastian noticed the ISR indirectly acquires spin_locks, which are
sleeping locks under PREEMPT_RT, which leads to kernel splats.

Fixes: 338c4d3902feb ("igb: Disable threaded IRQ for igb_msix_other")
Reported-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Wander Lairson Costa <[email protected]>
Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
Acked-by: Przemek Kitszel <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agoMerge tag 'for-net-2024-11-12' of git://git.kernel.org/pub/scm/linux/kernel/git/bluet...
Jakub Kicinski [Wed, 13 Nov 2024 01:30:41 +0000 (17:30 -0800)]
Merge tag 'for-net-2024-11-12' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - btintel: Direct exception event to bluetooth stack
 - hci_core: Fix calling mgmt_device_connected

* tag 'for-net-2024-11-12' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: btintel: Direct exception event to bluetooth stack
  Bluetooth: hci_core: Fix calling mgmt_device_connected
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agoMerge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Linus Torvalds [Wed, 13 Nov 2024 00:39:34 +0000 (16:39 -0800)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio fix from Michael Tsirkin:
 "A last minute mlx5 bugfix"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  vdpa/mlx5: Fix PA offset with unaligned starting iotlb map

4 months agomm: swapfile: fix cluster reclaim work crash on rotational devices
Johannes Weiner [Thu, 7 Nov 2024 14:08:36 +0000 (09:08 -0500)]
mm: swapfile: fix cluster reclaim work crash on rotational devices

syzbot and Daan report a NULL pointer crash in the new full swap cluster
reclaim work:

> Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN PTI
> KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
> CPU: 1 UID: 0 PID: 51 Comm: kworker/1:1 Not tainted 6.12.0-rc6-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
> Workqueue: events swap_reclaim_work
> RIP: 0010:__list_del_entry_valid_or_report+0x20/0x1c0 lib/list_debug.c:49
> Code: 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 48 89 fe 48 83 c7 08 48 83 ec 18 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 19 01 00 00 48 89 f2 48 8b 4e 08 48 b8 00 00 00
> RSP: 0018:ffffc90000bb7c30 EFLAGS: 00010202
> RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffff88807b9ae078
> RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000008
> RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000001 R11: 000000000000004f R12: dffffc0000000000
> R13: ffffffffffffffb8 R14: ffff88807b9ae000 R15: ffffc90003af1000
> FS:  0000000000000000(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fffaca68fb8 CR3: 00000000791c8000 CR4: 00000000003526f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  <TASK>
>  __list_del_entry_valid include/linux/list.h:124 [inline]
>  __list_del_entry include/linux/list.h:215 [inline]
>  list_move_tail include/linux/list.h:310 [inline]
>  swap_reclaim_full_clusters+0x109/0x460 mm/swapfile.c:748
>  swap_reclaim_work+0x2e/0x40 mm/swapfile.c:779

The syzbot console output indicates a virtual environment where swapfile
is on a rotational device.  In this case, clusters aren't actually used,
and si->full_clusters is not initialized.  Daan's report is from qemu, so
likely rotational too.

Make sure to only schedule the cluster reclaim work when clusters are
actually in use.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/lkml/[email protected]/
Link: https://github.com/systemd/systemd/issues/35044
Fixes: 5168a68eb78f ("mm, swap: avoid over reclaim of full clusters")
Reported-by: [email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reported-by: Daan De Meyer <[email protected]>
Cc: Kairui Song <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
4 months agovdpa/mlx5: Fix PA offset with unaligned starting iotlb map
Si-Wei Liu [Mon, 21 Oct 2024 13:40:39 +0000 (16:40 +0300)]
vdpa/mlx5: Fix PA offset with unaligned starting iotlb map

When calculating the physical address range based on the iotlb and mr
[start,end) ranges, the offset of mr->start relative to map->start
is not taken into account. This leads to some incorrect and duplicate
mappings.

For the case when mr->start < map->start the code is already correct:
the range in [mr->start, map->start) was handled by a different
iteration.

Fixes: 94abbccdf291 ("vdpa/mlx5: Add shared memory registration code")
Cc: [email protected]
Signed-off-by: Si-Wei Liu <[email protected]>
Signed-off-by: Dragos Tatulea <[email protected]>
Message-Id: <20241021134040[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Acked-by: Jason Wang <[email protected]>
4 months agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Tue, 12 Nov 2024 21:35:13 +0000 (13:35 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "x86 and selftests fixes.

  x86:

   - When emulating a guest TLB flush for a nested guest, flush vpid01,
     not vpid02, if L2 is active but VPID is disabled in vmcs12, i.e. if
     L2 and L1 are sharing VPID '0' (from L1's perspective).

   - Fix a bug in the SNP initialization flow where KVM would return '0'
     to userspace instead of -errno on failure.

   - Move the Intel PT virtualization (i.e. outputting host trace to
     host buffer and guest trace to guest buffer) behind CONFIG_BROKEN.

   - Fix memory leak on failure of KVM_SEV_SNP_LAUNCH_START

   - Fix a bug where KVM fails to inject an interrupt from the IRR after
     KVM_SET_LAPIC.

  Selftests:

   - Increase the timeout for the memslot performance selftest to avoid
     false failures on arm64 and nested x86 platforms.

   - Fix a goof in the guest_memfd selftest where a for-loop initialized
     a bit mask to zero instead of BIT(0).

   - Disable strict aliasing when building KVM selftests to prevent the
     compiler from treating things like "u64 *" to "uint64_t *" cases as
     undefined behavior, which can lead to nasty, hard to debug
     failures.

   - Force -march=x86-64-v2 for KVM x86 selftests if and only if the
     uarch is supported by the compiler.

   - Fix broken compilation of kvm selftests after a header sync in
     tools/"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: VMX: Bury Intel PT virtualization (guest/host mode) behind CONFIG_BROKEN
  KVM: x86: Unconditionally set irr_pending when updating APICv state
  kvm: svm: Fix gctx page leak on invalid inputs
  KVM: selftests: use X86_MEMTYPE_WB instead of VMX_BASIC_MEM_TYPE_WB
  KVM: SVM: Propagate error from snp_guest_req_init() to userspace
  KVM: nVMX: Treat vpid01 as current if L2 is active, but with VPID disabled
  KVM: selftests: Don't force -march=x86-64-v2 if it's unsupported
  KVM: selftests: Disable strict aliasing
  KVM: selftests: fix unintentional noop test in guest_memfd_test.c
  KVM: selftests: memslot_perf_test: increase guest sync timeout

4 months agoMerge tag 'for-6.12/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Tue, 12 Nov 2024 21:21:07 +0000 (13:21 -0800)]
Merge tag 'for-6.12/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mikulas Patocka:

 - fix warnings about duplicate slab cache names

* tag 'for-6.12/dm-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm-cache: fix warnings about duplicate slab caches
  dm-bufio: fix warnings about duplicate slab caches

4 months agoMerge tag 'integrity-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar...
Linus Torvalds [Tue, 12 Nov 2024 21:06:31 +0000 (13:06 -0800)]
Merge tag 'integrity-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity

Pull integrity fixes from Mimi Zohar:
 "One bug fix, one performance improvement, and the use of
  static_assert:

   - The bug fix addresses "only a cosmetic change" commit, which didn't
     take into account the original 'ima' template definition.

  - The performance improvement limits the atomic_read()"

* tag 'integrity-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
  integrity: Use static_assert() to check struct sizes
  evm: stop avoidably reading i_writecount in evm_file_release
  ima: fix buffer overrun in ima_eventdigest_init_common

4 months agoMerge tag 'landlock-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic...
Linus Torvalds [Tue, 12 Nov 2024 21:01:09 +0000 (13:01 -0800)]
Merge tag 'landlock-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux

Pull landlock fixes from Mickaël Salaün:
 "This fixes issues in the Landlock's sandboxer sample and
  documentation, slightly refactors helpers (required for ongoing patch
  series), and improve/fix a feature merged in v6.12 (signal and
  abstract UNIX socket scoping)"

* tag 'landlock-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mic/linux:
  landlock: Optimize scope enforcement
  landlock: Refactor network access mask management
  landlock: Refactor filesystem access mask management
  samples/landlock: Clarify option parsing behaviour
  samples/landlock: Refactor help message
  samples/landlock: Fix port parsing in sandboxer
  landlock: Fix grammar issues in documentation
  landlock: Improve documentation of previous limitations

4 months agoselftests: hugetlb_dio: fixup check for initial conditions to skip in the start
Donet Tom [Sun, 10 Nov 2024 06:49:03 +0000 (00:49 -0600)]
selftests: hugetlb_dio: fixup check for initial conditions to skip in the start

This test verifies that a hugepage, used as a user buffer for DIO
operations, is correctly freed upon unmapping.  To test this, we read the
count of free hugepages before and after the mmap, DIO, and munmap
operations, then check if the free hugepage count is the same.

Reading free hugepages before the test was removed by commit 0268d4579901
('selftests: hugetlb_dio: check for initial conditions to skip at the
start'), causing the test to always fail.

This patch adds back reading the free hugepages before starting the test.
With this patch, the tests are now passing.

Test results without this patch:

./tools/testing/selftests/mm/hugetlb_dio
TAP version 13
1..4
 # No. Free pages before allocation : 0
 # No. Free pages after munmap : 100
not ok 1 : Huge pages not freed!
 # No. Free pages before allocation : 0
 # No. Free pages after munmap : 100
not ok 2 : Huge pages not freed!
 # No. Free pages before allocation : 0
 # No. Free pages after munmap : 100
not ok 3 : Huge pages not freed!
 # No. Free pages before allocation : 0
 # No. Free pages after munmap : 100
not ok 4 : Huge pages not freed!
 # Totals: pass:0 fail:4 xfail:0 xpass:0 skip:0 error:0

Test results with this patch:

/tools/testing/selftests/mm/hugetlb_dio
TAP version 13
1..4
# No. Free pages before allocation : 100
# No. Free pages after munmap : 100
ok 1 : Huge pages freed successfully !
# No. Free pages before allocation : 100
# No. Free pages after munmap : 100
ok 2 : Huge pages freed successfully !
# No. Free pages before allocation : 100
# No. Free pages after munmap : 100
ok 3 : Huge pages freed successfully !
# No. Free pages before allocation : 100
# No. Free pages after munmap : 100
ok 4 : Huge pages freed successfully !

# Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 0268d4579901 ("selftests: hugetlb_dio: check for initial conditions to skip in the start")
Signed-off-by: Donet Tom <[email protected]>
Cc: Muhammad Usama Anjum <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
4 months agomm/thp: fix deferred split queue not partially_mapped: fix
Hugh Dickins [Sun, 10 Nov 2024 21:11:21 +0000 (13:11 -0800)]
mm/thp: fix deferred split queue not partially_mapped: fix

Though even more elusive than before, list_del corruption has still been
seen on THP's deferred split queue.

The idea in commit e66f3185fa04 was right, but its implementation wrong.
The context omitted an important comment just before the critical test:
"split_folio() removes folio from list on success." In ignoring that
comment, when a THP split succeeded, the code went on to release the
preceding safe folio, preserving instead an irrelevant (formerly head)
folio: which gives no safety because it's not on the list.  Fix the logic.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: e66f3185fa04 ("mm/thp: fix deferred split queue not partially_mapped")
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Usama Arif <[email protected]>
Reviewed-by: Zi Yan <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Chris Li <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Kefeng Wang <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
4 months agomm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases
John Hubbard [Tue, 5 Nov 2024 03:29:44 +0000 (19:29 -0800)]
mm/gup: avoid an unnecessary allocation call for FOLL_LONGTERM cases

commit 53ba78de064b ("mm/gup: introduce
check_and_migrate_movable_folios()") created a new constraint on the
pin_user_pages*() API family: a potentially large internal allocation must
now occur, for FOLL_LONGTERM cases.

A user-visible consequence has now appeared: user space can no longer pin
more than 2GB of memory anymore on x86_64.  That's because, on a 4KB
PAGE_SIZE system, when user space tries to (indirectly, via a device
driver that calls pin_user_pages()) pin 2GB, this requires an allocation
of a folio pointers array of MAX_PAGE_ORDER size, which is the limit for
kmalloc().

In addition to the directly visible effect described above, there is also
the problem of adding an unnecessary allocation.  The **pages array
argument has already been allocated, and there is no need for a redundant
**folios array allocation in this case.

Fix this by avoiding the new allocation entirely.  This is done by
referring to either the original page[i] within **pages, or to the
associated folio.  Thanks to David Hildenbrand for suggesting this
approach and for providing the initial implementation (which I've tested
and adjusted slightly) as well.

[[email protected]: whitespace tweak, per David]
Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: bypass pofs_get_folio(), per Oscar]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 53ba78de064b ("mm/gup: introduce check_and_migrate_movable_folios()")
Signed-off-by: John Hubbard <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Cc: Vivek Kasireddy <[email protected]>
Cc: Dave Airlie <[email protected]>
Cc: Gerd Hoffmann <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Dongwon Kim <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Junxiao Chang <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
4 months agoBluetooth: btintel: Direct exception event to bluetooth stack
Kiran K [Tue, 22 Oct 2024 09:11:34 +0000 (14:41 +0530)]
Bluetooth: btintel: Direct exception event to bluetooth stack

Have exception event part of HCI traces which helps for debug.

snoop traces:
> HCI Event: Vendor (0xff) plen 79
        Vendor Prefix (0x8780)
      Intel Extended Telemetry (0x03)
        Unknown extended telemetry event type (0xde)
        01 01 de
        Unknown extended subevent 0x07
        01 01 de 07 01 de 06 1c ef be ad de ef be ad de
        ef be ad de ef be ad de ef be ad de ef be ad de
        ef be ad de 05 14 ef be ad de ef be ad de ef be
        ad de ef be ad de ef be ad de 43 10 ef be ad de
        ef be ad de ef be ad de ef be ad de

Fixes: af395330abed ("Bluetooth: btintel: Add Intel devcoredump support")
Signed-off-by: Kiran K <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
4 months agoBluetooth: hci_core: Fix calling mgmt_device_connected
Luiz Augusto von Dentz [Fri, 8 Nov 2024 16:19:54 +0000 (11:19 -0500)]
Bluetooth: hci_core: Fix calling mgmt_device_connected

Since 61a939c68ee0 ("Bluetooth: Queue incoming ACL data until
BT_CONNECTED state is reached") there is no long the need to call
mgmt_device_connected as ACL data will be queued until BT_CONNECTED
state.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=219458
Link: https://github.com/bluez/bluez/issues/1014
Fixes: 333b4fd11e89 ("Bluetooth: L2CAP: Fix uaf in l2cap_connect")
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
4 months agoMerge branch 'virtio-vsock-fix-memory-leaks'
Paolo Abeni [Tue, 12 Nov 2024 11:16:54 +0000 (12:16 +0100)]
Merge branch 'virtio-vsock-fix-memory-leaks'

Michal Luczaj says:

====================
virtio/vsock: Fix memory leaks

Short series fixing some memory leaks that I've stumbled upon while toying
with the selftests.

Signed-off-by: Michal Luczaj <[email protected]>
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
4 months agovirtio/vsock: Improve MSG_ZEROCOPY error handling
Michal Luczaj [Thu, 7 Nov 2024 20:46:14 +0000 (21:46 +0100)]
virtio/vsock: Improve MSG_ZEROCOPY error handling

Add a missing kfree_skb() to prevent memory leaks.

Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
Reviewed-by: Stefano Garzarella <[email protected]>
Signed-off-by: Michal Luczaj <[email protected]>
Acked-by: Arseniy Krasnov <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
4 months agovsock: Fix sk_error_queue memory leak
Michal Luczaj [Thu, 7 Nov 2024 20:46:13 +0000 (21:46 +0100)]
vsock: Fix sk_error_queue memory leak

Kernel queues MSG_ZEROCOPY completion notifications on the error queue.
Where they remain, until explicitly recv()ed. To prevent memory leaks,
clean up the queue when the socket is destroyed.

unreferenced object 0xffff8881028beb00 (size 224):
  comm "vsock_test", pid 1218, jiffies 4294694897
  hex dump (first 32 bytes):
    90 b0 21 17 81 88 ff ff 90 b0 21 17 81 88 ff ff  ..!.......!.....
    00 00 00 00 00 00 00 00 00 b0 21 17 81 88 ff ff  ..........!.....
  backtrace (crc 6c7031ca):
    [<ffffffff81418ef7>] kmem_cache_alloc_node_noprof+0x2f7/0x370
    [<ffffffff81d35882>] __alloc_skb+0x132/0x180
    [<ffffffff81d2d32b>] sock_omalloc+0x4b/0x80
    [<ffffffff81d3a8ae>] msg_zerocopy_realloc+0x9e/0x240
    [<ffffffff81fe5cb2>] virtio_transport_send_pkt_info+0x412/0x4c0
    [<ffffffff81fe6183>] virtio_transport_stream_enqueue+0x43/0x50
    [<ffffffff81fe0813>] vsock_connectible_sendmsg+0x373/0x450
    [<ffffffff81d233d5>] ____sys_sendmsg+0x365/0x3a0
    [<ffffffff81d246f4>] ___sys_sendmsg+0x84/0xd0
    [<ffffffff81d26f47>] __sys_sendmsg+0x47/0x80
    [<ffffffff820d3df3>] do_syscall_64+0x93/0x180
    [<ffffffff8220012b>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
Signed-off-by: Michal Luczaj <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
Acked-by: Arseniy Krasnov <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
4 months agovirtio/vsock: Fix accept_queue memory leak
Michal Luczaj [Thu, 7 Nov 2024 20:46:12 +0000 (21:46 +0100)]
virtio/vsock: Fix accept_queue memory leak

As the final stages of socket destruction may be delayed, it is possible
that virtio_transport_recv_listen() will be called after the accept_queue
has been flushed, but before the SOCK_DONE flag has been set. As a result,
sockets enqueued after the flush would remain unremoved, leading to a
memory leak.

vsock_release
  __vsock_release
    lock
    virtio_transport_release
      virtio_transport_close
        schedule_delayed_work(close_work)
    sk_shutdown = SHUTDOWN_MASK
(!) flush accept_queue
    release
                                        virtio_transport_recv_pkt
                                          vsock_find_bound_socket
                                          lock
                                          if flag(SOCK_DONE) return
                                          virtio_transport_recv_listen
                                            child = vsock_create_connected
                                      (!)   vsock_enqueue_accept(child)
                                          release
close_work
  lock
  virtio_transport_do_close
    set_flag(SOCK_DONE)
    virtio_transport_remove_sock
      vsock_remove_sock
        vsock_remove_bound
  release

Introduce a sk_shutdown check to disallow vsock_enqueue_accept() during
socket destruction.

unreferenced object 0xffff888109e3f800 (size 2040):
  comm "kworker/5:2", pid 371, jiffies 4294940105
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    28 00 0b 40 00 00 00 00 00 00 00 00 00 00 00 00  (..@............
  backtrace (crc 9e5f4e84):
    [<ffffffff81418ff1>] kmem_cache_alloc_noprof+0x2c1/0x360
    [<ffffffff81d27aa0>] sk_prot_alloc+0x30/0x120
    [<ffffffff81d2b54c>] sk_alloc+0x2c/0x4b0
    [<ffffffff81fe049a>] __vsock_create.constprop.0+0x2a/0x310
    [<ffffffff81fe6d6c>] virtio_transport_recv_pkt+0x4dc/0x9a0
    [<ffffffff81fe745d>] vsock_loopback_work+0xfd/0x140
    [<ffffffff810fc6ac>] process_one_work+0x20c/0x570
    [<ffffffff810fce3f>] worker_thread+0x1bf/0x3a0
    [<ffffffff811070dd>] kthread+0xdd/0x110
    [<ffffffff81044fdd>] ret_from_fork+0x2d/0x50
    [<ffffffff8100785a>] ret_from_fork_asm+0x1a/0x30

Fixes: 3fe356d58efa ("vsock/virtio: discard packets only when socket is really closed")
Reviewed-by: Stefano Garzarella <[email protected]>
Signed-off-by: Michal Luczaj <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
4 months agobcachefs: Fix assertion pop in bch2_ptr_swab()
Kent Overstreet [Mon, 11 Nov 2024 21:15:15 +0000 (16:15 -0500)]
bcachefs: Fix assertion pop in bch2_ptr_swab()

This runs on extents that haven't yet been validated, so we don't want
to assert that we have a valid entry type.

Reported-by: [email protected]
Signed-off-by: Kent Overstreet <[email protected]>
4 months agobcachefs: Fix journal_entry_dev_usage_to_text() overrun
Kent Overstreet [Mon, 11 Nov 2024 21:01:38 +0000 (16:01 -0500)]
bcachefs: Fix journal_entry_dev_usage_to_text() overrun

If the jset_entry_dev_usage is malformed, and too small, our nr_entries
calculation will be incorrect - just bail out.

Reported-by: [email protected]
Signed-off-by: Kent Overstreet <[email protected]>
4 months agoLoongArch: Fix AP booting issue in VM mode
Bibo Mao [Tue, 12 Nov 2024 08:35:39 +0000 (16:35 +0800)]
LoongArch: Fix AP booting issue in VM mode

Native IPI is used for AP booting, because it is the booting interface
between OS and BIOS firmware. The paravirt IPI is only used inside OS,
and native IPI is necessary to boot AP.

When booting AP, we write the kernel entry address in the HW mailbox of
AP and send IPI interrupt to it. AP executes idle instruction and waits
for interrupts or SW events, then clears IPI interrupt and jumps to the
kernel entry from HW mailbox.

Between writing HW mailbox and sending IPI, AP can be woken up by SW
events and jumps to the kernel entry, so ACTION_BOOT_CPU IPI interrupt
will keep pending during AP booting. And native IPI interrupt handler
needs be registered so that it can clear pending native IPI, else there
will be endless interrupts during AP booting stage.

Here native IPI interrupt is initialized even if paravirt IPI is used.

Cc: [email protected]
Fixes: 74c16b2e2b0c ("LoongArch: KVM: Add PV IPI support on guest side")
Signed-off-by: Bibo Mao <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
4 months agoLoongArch: Add WriteCombine shadow mapping in KASAN
Kanglong Wang [Tue, 12 Nov 2024 08:35:39 +0000 (16:35 +0800)]
LoongArch: Add WriteCombine shadow mapping in KASAN

Currently, the kernel couldn't boot when ARCH_IOREMAP, ARCH_WRITECOMBINE
and KASAN are enabled together. Because DMW2 is used by kernel now which
is configured as 0xa000000000000000 for WriteCombine, but KASAN has no
segment mapping for it. This patch fix this issue.

Solution: Add the relevant definitions for WriteCombine (DMW2) in KASAN.

Cc: [email protected]
Fixes: 8e02c3b782ec ("LoongArch: Add writecombine support for DMW-based ioremap()")
Signed-off-by: Kanglong Wang <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
4 months agoLoongArch: Disable KASAN if PGDIR_SIZE is too large for cpu_vabits
Huacai Chen [Tue, 12 Nov 2024 08:35:39 +0000 (16:35 +0800)]
LoongArch: Disable KASAN if PGDIR_SIZE is too large for cpu_vabits

If PGDIR_SIZE is too large for cpu_vabits, KASAN_SHADOW_END will
overflow UINTPTR_MAX because KASAN_SHADOW_START/KASAN_SHADOW_END are
aligned up by PGDIR_SIZE. And then the overflowed KASAN_SHADOW_END looks
like a user space address.

For example, PGDIR_SIZE of CONFIG_4KB_4LEVEL is 2^39, which is too large
for Loongson-2K series whose cpu_vabits = 39.

Since CONFIG_4KB_4LEVEL is completely legal for CPUs with cpu_vabits <=
39, we just disable KASAN via early return in kasan_init(). Otherwise we
get a boot failure.

Moreover, we change KASAN_SHADOW_END from the first address after KASAN
shadow area to the last address in KASAN shadow area, in order to avoid
the end address exactly overflow to 0 (which is a legal case). We don't
need to worry about alignment because pgd_addr_end() can handle it.

Cc: [email protected]
Reviewed-by: Jiaxun Yang <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
4 months agoLoongArch: Make KASAN work with 5-level page-tables
Huacai Chen [Tue, 12 Nov 2024 08:35:39 +0000 (16:35 +0800)]
LoongArch: Make KASAN work with 5-level page-tables

Make KASAN work with 5-level page-tables, including:
1. Implement and use __pgd_none() and kasan_p4d_offset().
2. As done in kasan_pmd_populate() and kasan_pte_populate(), restrict
   the loop conditions of kasan_p4d_populate() and kasan_pud_populate()
   to avoid unnecessary population.

Cc: [email protected]
Signed-off-by: Huacai Chen <[email protected]>
4 months agoLoongArch: Define a default value for VM_DATA_DEFAULT_FLAGS
Yuli Wang [Tue, 12 Nov 2024 08:35:39 +0000 (16:35 +0800)]
LoongArch: Define a default value for VM_DATA_DEFAULT_FLAGS

This is a trivial cleanup, commit c62da0c35d58518d ("mm/vma: define a
default value for VM_DATA_DEFAULT_FLAGS") has unified default values of
VM_DATA_DEFAULT_FLAGS across different platforms.

Apply the same consistency to LoongArch.

Suggested-by: Wentao Guan <[email protected]>
Signed-off-by: Yuli Wang <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
4 months agoLoongArch: Fix early_numa_add_cpu() usage for FDT systems
Huacai Chen [Tue, 12 Nov 2024 08:35:36 +0000 (16:35 +0800)]
LoongArch: Fix early_numa_add_cpu() usage for FDT systems

early_numa_add_cpu() applies on physical CPU id rather than logical CPU
id, so use cpuid instead of cpu.

Cc: [email protected]
Fixes: 3de9c42d02a79a5 ("LoongArch: Add all CPUs enabled by fdt to NUMA node 0")
Reported-by: Bibo Mao <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
4 months agoLoongArch: For all possible CPUs setup logical-physical CPU mapping
Huacai Chen [Tue, 12 Nov 2024 08:35:36 +0000 (16:35 +0800)]
LoongArch: For all possible CPUs setup logical-physical CPU mapping

In order to support ACPI-based physical CPU hotplug, we suppose for all
"possible" CPUs cpu_logical_map() can work. Because some drivers want to
use cpu_logical_map() for all "possible" CPUs, while currently we only
setup logical-physical mapping for "present" CPUs. This lack of mapping
also causes cpu_to_node() cannot work for hot-added CPUs.

All "possible" CPUs are listed in MADT, and the "present" subset is
marked as ACPI_MADT_ENABLED. To setup logical-physical CPU mapping for
all possible CPUs and keep present CPUs continuous in cpu_present_mask,
we parse MADT twice. The first pass handles CPUs with ACPI_MADT_ENABLED
and the second pass handles CPUs without ACPI_MADT_ENABLED.

The global flag (cpu_enumerated) is removed because acpi_map_cpu() calls
cpu_number_map() rather than set_processor_mask() now.

Reported-by: Bibo Mao <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
4 months agoMerge branch 'mlx5-misc-fixes-2024-11-07'
Jakub Kicinski [Tue, 12 Nov 2024 03:23:40 +0000 (19:23 -0800)]
Merge branch 'mlx5-misc-fixes-2024-11-07'

Tariq Toukan says:

====================
mlx5 misc fixes 2024-11-07

This patchset provides misc bug fixes from the team to the mlx5 core and
Eth drivers.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agonet/mlx5e: Disable loopback self-test on multi-PF netdev
Carolina Jubran [Thu, 7 Nov 2024 18:35:27 +0000 (20:35 +0200)]
net/mlx5e: Disable loopback self-test on multi-PF netdev

In Multi-PF (Socket Direct) configurations, when a loopback packet is
sent through one of the secondary devices, it will always be received
on the primary device. This causes the loopback layer to fail in
identifying the loopback packet as the devices are different.

To avoid false test failures, disable the loopback self-test in
Multi-PF configurations.

Fixes: ed29705e4ed1 ("net/mlx5: Enable SD feature")
Signed-off-by: Carolina Jubran <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agonet/mlx5e: CT: Fix null-ptr-deref in add rule err flow
Moshe Shemesh [Thu, 7 Nov 2024 18:35:26 +0000 (20:35 +0200)]
net/mlx5e: CT: Fix null-ptr-deref in add rule err flow

In error flow of mlx5_tc_ct_entry_add_rule(), in case ct_rule_add()
callback returns error, zone_rule->attr is used uninitiated. Fix it to
use attr which has the needed pointer value.

Kernel log:
 BUG: kernel NULL pointer dereference, address: 0000000000000110
 RIP: 0010:mlx5_tc_ct_entry_add_rule+0x2b1/0x2f0 [mlx5_core]

 Call Trace:
  <TASK>
  ? __die+0x20/0x70
  ? page_fault_oops+0x150/0x3e0
  ? exc_page_fault+0x74/0x140
  ? asm_exc_page_fault+0x22/0x30
  ? mlx5_tc_ct_entry_add_rule+0x2b1/0x2f0 [mlx5_core]
  ? mlx5_tc_ct_entry_add_rule+0x1d5/0x2f0 [mlx5_core]
  mlx5_tc_ct_block_flow_offload+0xc6a/0xf90 [mlx5_core]
  ? nf_flow_offload_tuple+0xd8/0x190 [nf_flow_table]
  nf_flow_offload_tuple+0xd8/0x190 [nf_flow_table]
  flow_offload_work_handler+0x142/0x320 [nf_flow_table]
  ? finish_task_switch.isra.0+0x15b/0x2b0
  process_one_work+0x16c/0x320
  worker_thread+0x28c/0x3a0
  ? __pfx_worker_thread+0x10/0x10
  kthread+0xb8/0xf0
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x2d/0x50
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>

Fixes: 7fac5c2eced3 ("net/mlx5: CT: Avoid reusing modify header context for natted entries")
Signed-off-by: Moshe Shemesh <[email protected]>
Reviewed-by: Cosmin Ratiu <[email protected]>
Reviewed-by: Yevgeny Kliteynik <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agonet/mlx5e: clear xdp features on non-uplink representors
William Tu [Thu, 7 Nov 2024 18:35:25 +0000 (20:35 +0200)]
net/mlx5e: clear xdp features on non-uplink representors

Non-uplink representor port does not support XDP. The patch clears
the xdp feature by checking the net_device_ops.ndo_bpf is set or not.

Verify using the netlink tool:
$ tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml --dump dev-get

Representor netdev before the patch:
{'ifindex': 8,
  'xdp-features': {'basic',
                   'ndo-xmit',
                   'ndo-xmit-sg',
                   'redirect',
                   'rx-sg',
                   'xsk-zerocopy'},
  'xdp-rx-metadata-features': set(),
  'xdp-zc-max-segs': 1,
  'xsk-features': set()},
With the patch:
 {'ifindex': 8,
  'xdp-features': set(),
  'xdp-rx-metadata-features': set(),
  'xsk-features': set()},

Fixes: 4d5ab0ad964d ("net/mlx5e: take into account device reconfiguration for xdp_features flag")
Signed-off-by: William Tu <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agonet/mlx5e: kTLS, Fix incorrect page refcounting
Dragos Tatulea [Thu, 7 Nov 2024 18:35:24 +0000 (20:35 +0200)]
net/mlx5e: kTLS, Fix incorrect page refcounting

The kTLS tx handling code is using a mix of get_page() and
page_ref_inc() APIs to increment the page reference. But on the release
path (mlx5e_ktls_tx_handle_resync_dump_comp()), only put_page() is used.

This is an issue when using pages from large folios: the get_page()
references are stored on the folio page while the page_ref_inc()
references are stored directly in the given page. On release the folio
page will be dereferenced too many times.

This was found while doing kTLS testing with sendfile() + ZC when the
served file was read from NFS on a kernel with NFS large folios support
(commit 49b29a573da8 ("nfs: add support for large folios")).

Fixes: 84d1bb2b139e ("net/mlx5e: kTLS, Limit DUMP wqe size")
Signed-off-by: Dragos Tatulea <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agonet/mlx5: fs, lock FTE when checking if active
Mark Bloch [Thu, 7 Nov 2024 18:35:23 +0000 (20:35 +0200)]
net/mlx5: fs, lock FTE when checking if active

The referenced commits introduced a two-step process for deleting FTEs:

- Lock the FTE, delete it from hardware, set the hardware deletion function
  to NULL and unlock the FTE.
- Lock the parent flow group, delete the software copy of the FTE, and
  remove it from the xarray.

However, this approach encounters a race condition if a rule with the same
match value is added simultaneously. In this scenario, fs_core may set the
hardware deletion function to NULL prematurely, causing a panic during
subsequent rule deletions.

To prevent this, ensure the active flag of the FTE is checked under a lock,
which will prevent the fs_core layer from attaching a new steering rule to
an FTE that is in the process of deletion.

[  438.967589] MOSHE: 2496 mlx5_del_flow_rules del_hw_func
[  438.968205] ------------[ cut here ]------------
[  438.968654] refcount_t: decrement hit 0; leaking memory.
[  438.969249] WARNING: CPU: 0 PID: 8957 at lib/refcount.c:31 refcount_warn_saturate+0xfb/0x110
[  438.970054] Modules linked in: act_mirred cls_flower act_gact sch_ingress openvswitch nsh mlx5_vdpa vringh vhost_iotlb vdpa mlx5_ib mlx5_core xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm ib_uverbs ib_core zram zsmalloc fuse [last unloaded: cls_flower]
[  438.973288] CPU: 0 UID: 0 PID: 8957 Comm: tc Not tainted 6.12.0-rc1+ #8
[  438.973888] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[  438.974874] RIP: 0010:refcount_warn_saturate+0xfb/0x110
[  438.975363] Code: 40 66 3b 82 c6 05 16 e9 4d 01 01 e8 1f 7c a0 ff 0f 0b c3 cc cc cc cc 48 c7 c7 10 66 3b 82 c6 05 fd e8 4d 01 01 e8 05 7c a0 ff <0f> 0b c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 90
[  438.976947] RSP: 0018:ffff888124a53610 EFLAGS: 00010286
[  438.977446] RAX: 0000000000000000 RBX: ffff888119d56de0 RCX: 0000000000000000
[  438.978090] RDX: ffff88852c828700 RSI: ffff88852c81b3c0 RDI: ffff88852c81b3c0
[  438.978721] RBP: ffff888120fa0e88 R08: 0000000000000000 R09: ffff888124a534b0
[  438.979353] R10: 0000000000000001 R11: 0000000000000001 R12: ffff888119d56de0
[  438.979979] R13: ffff888120fa0ec0 R14: ffff888120fa0ee8 R15: ffff888119d56de0
[  438.980607] FS:  00007fe6dcc0f800(0000) GS:ffff88852c800000(0000) knlGS:0000000000000000
[  438.983984] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  438.984544] CR2: 00000000004275e0 CR3: 0000000186982001 CR4: 0000000000372eb0
[  438.985205] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  438.985842] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  438.986507] Call Trace:
[  438.986799]  <TASK>
[  438.987070]  ? __warn+0x7d/0x110
[  438.987426]  ? refcount_warn_saturate+0xfb/0x110
[  438.987877]  ? report_bug+0x17d/0x190
[  438.988261]  ? prb_read_valid+0x17/0x20
[  438.988659]  ? handle_bug+0x53/0x90
[  438.989054]  ? exc_invalid_op+0x14/0x70
[  438.989458]  ? asm_exc_invalid_op+0x16/0x20
[  438.989883]  ? refcount_warn_saturate+0xfb/0x110
[  438.990348]  mlx5_del_flow_rules+0x2f7/0x340 [mlx5_core]
[  438.990932]  __mlx5_eswitch_del_rule+0x49/0x170 [mlx5_core]
[  438.991519]  ? mlx5_lag_is_sriov+0x3c/0x50 [mlx5_core]
[  438.992054]  ? xas_load+0x9/0xb0
[  438.992407]  mlx5e_tc_rule_unoffload+0x45/0xe0 [mlx5_core]
[  438.993037]  mlx5e_tc_del_fdb_flow+0x2a6/0x2e0 [mlx5_core]
[  438.993623]  mlx5e_flow_put+0x29/0x60 [mlx5_core]
[  438.994161]  mlx5e_delete_flower+0x261/0x390 [mlx5_core]
[  438.994728]  tc_setup_cb_destroy+0xb9/0x190
[  438.995150]  fl_hw_destroy_filter+0x94/0xc0 [cls_flower]
[  438.995650]  fl_change+0x11a4/0x13c0 [cls_flower]
[  438.996105]  tc_new_tfilter+0x347/0xbc0
[  438.996503]  ? ___slab_alloc+0x70/0x8c0
[  438.996929]  rtnetlink_rcv_msg+0xf9/0x3e0
[  438.997339]  ? __netlink_sendskb+0x4c/0x70
[  438.997751]  ? netlink_unicast+0x286/0x2d0
[  438.998171]  ? __pfx_rtnetlink_rcv_msg+0x10/0x10
[  438.998625]  netlink_rcv_skb+0x54/0x100
[  438.999020]  netlink_unicast+0x203/0x2d0
[  438.999421]  netlink_sendmsg+0x1e4/0x420
[  438.999820]  __sock_sendmsg+0xa1/0xb0
[  439.000203]  ____sys_sendmsg+0x207/0x2a0
[  439.000600]  ? copy_msghdr_from_user+0x6d/0xa0
[  439.001072]  ___sys_sendmsg+0x80/0xc0
[  439.001459]  ? ___sys_recvmsg+0x8b/0xc0
[  439.001848]  ? generic_update_time+0x4d/0x60
[  439.002282]  __sys_sendmsg+0x51/0x90
[  439.002658]  do_syscall_64+0x50/0x110
[  439.003040]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: 718ce4d601db ("net/mlx5: Consolidate update FTE for all removal changes")
Fixes: cefc23554fc2 ("net/mlx5: Fix FTE cleanup")
Signed-off-by: Mark Bloch <[email protected]>
Reviewed-by: Maor Gottlieb <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agonet/mlx5: Fix msix vectors to respect platform limit
Parav Pandit [Thu, 7 Nov 2024 18:35:22 +0000 (20:35 +0200)]
net/mlx5: Fix msix vectors to respect platform limit

The number of PCI vectors allocated by the platform (which may be fewer
than requested) is currently not honored when creating the SF pool;
only the PCI MSI-X capability is considered.

As a result, when a platform allocates fewer vectors
(in non-dynamic mode) than requested, the PF and SF pools end up
with an invalid vector range.

This causes incorrect SF vector accounting, which leads to the
following call trace when an invalid IRQ vector is allocated.

This issue is resolved by ensuring that the platform's vector
limit is respected for both the SF and PF pools.

Workqueue: mlx5_vhca_event0 mlx5_sf_dev_add_active_work [mlx5_core]
RIP: 0010:pci_irq_vector+0x23/0x80
RSP: 0018:ffffabd5cebd7248 EFLAGS: 00010246
RAX: ffff980880e7f308 RBX: ffff9808932fb880 RCX: 0000000000000001
RDX: 00000000000001ff RSI: 0000000000000200 RDI: ffff980880e7f308
RBP: 0000000000000200 R08: 0000000000000010 R09: ffff97a9116f0860
R10: 0000000000000002 R11: 0000000000000228 R12: ffff980897cd0160
R13: 0000000000000000 R14: ffff97a920fec0c0 R15: ffffabd5cebd72d0
FS:  0000000000000000(0000) GS:ffff97c7ff9c0000(0000) knlGS:0000000000000000
 ? rescuer_thread+0x350/0x350
 kthread+0x11b/0x140
 ? __kthread_bind_mask+0x60/0x60
 ret_from_fork+0x22/0x30
mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22
mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22
mlx5_core.sf mlx5_core.sf.6: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0 enhanced)
mlx5_core.sf mlx5_core.sf.7: firmware version: 32.43.356
mlx5_core.sf mlx5_core.sf.6 enpa1s0f0s4: renamed from eth0
mlx5_core.sf mlx5_core.sf.7: Rate limit: 127 rates are supported, range: 0Mbps to 195312Mbps
mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22
mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22
mlx5_core 0000:a1:00.0: mlx5_irq_alloc:321:(pid 6781): Failed to request irq. err = -22

Fixes: 3354822cde5a ("net/mlx5: Use dynamic msix vectors allocation")
Signed-off-by: Parav Pandit <[email protected]>
Signed-off-by: Amir Tzin <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agonet/mlx5: E-switch, unload IB representors when unloading ETH representors
Chiara Meiohas [Thu, 7 Nov 2024 18:35:21 +0000 (20:35 +0200)]
net/mlx5: E-switch, unload IB representors when unloading ETH representors

IB representors depend on ETH representors, so the IB representors
should not exist without the ETH ones. When unloading the ETH
representors, the corresponding IB representors should be also
unloaded.

The commit 8d159eb2117b ("RDMA/mlx5: Use IB set_netdev and get_netdev functions")
introduced the use of the ib_device_set_netdev API in IB
repsresentors. ib_device_set_netdev() increments the refcount of
the representor's netdev when loading an IB representor and
decrements it when unloading.
Without the unloading of the IB representor, the refcount of the
representor's netdev remains greater than 0, preventing it from
being unregistered.
The patch uncovered an underlying bug where the eth representor is
unloaded, without unloading the IB representor.

This issue happened when using multiport E-switch and rebooting,
causing the shutdown to hang when unloading the ETH representor
because the refcount of the representor's netdevice was greater than 0.

Call trace:
unregister_netdevice: waiting for eth3 to become free. Usage count = 2
ref_tracker: eth%d@00000000661d60f7 has 1/1 users at
ib_device_set_netdev+0x160/0x2d0 [ib_core]
mlx5_ib_vport_rep_load+0x104/0x3f0 [mlx5_ib]
mlx5_eswitch_reload_ib_reps+0xfc/0x110 [mlx5_core]
mlx5_mpesw_work+0x236/0x330 [mlx5_core]
process_one_work+0x169/0x320
worker_thread+0x288/0x3a0
kthread+0xb8/0xe0
ret_from_fork+0x2d/0x50
ret_from_fork_asm+0x11/0x20

Fixes: 8d159eb2117b ("RDMA/mlx5: Use IB set_netdev and get_netdev functions")
Signed-off-by: Chiara Meiohas <[email protected]>
Reviewed-by: Mark Bloch <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agoMerge branch 'mptcp-fix-a-couple-of-races'
Jakub Kicinski [Tue, 12 Nov 2024 03:06:36 +0000 (19:06 -0800)]
Merge branch 'mptcp-fix-a-couple-of-races'

Paolo Abeni says:

====================
mptcp: fix a couple of races

The first patch addresses a division by zero issue reported by Eric,
the second one solves a similar issue found by code inspection while
investigating the former.
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agomptcp: cope racing subflow creation in mptcp_rcv_space_adjust
Paolo Abeni [Fri, 8 Nov 2024 10:58:17 +0000 (11:58 +0100)]
mptcp: cope racing subflow creation in mptcp_rcv_space_adjust

Additional active subflows - i.e. created by the in kernel path
manager - are included into the subflow list before starting the
3whs.

A racing recvmsg() spooling data received on an already established
subflow would unconditionally call tcp_cleanup_rbuf() on all the
current subflows, potentially hitting a divide by zero error on
the newly created ones.

Explicitly check that the subflow is in a suitable state before
invoking tcp_cleanup_rbuf().

Fixes: c76c6956566f ("mptcp: call tcp_cleanup_rbuf on subflows")
Signed-off-by: Paolo Abeni <[email protected]>
Reviewed-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://patch.msgid.link/02374660836e1b52afc91966b7535c8c5f7bafb0.1731060874.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agomptcp: error out earlier on disconnect
Paolo Abeni [Fri, 8 Nov 2024 10:58:16 +0000 (11:58 +0100)]
mptcp: error out earlier on disconnect

Eric reported a division by zero splat in the MPTCP protocol:

Oops: divide error: 0000 [#1] PREEMPT SMP KASAN PTI
CPU: 1 UID: 0 PID: 6094 Comm: syz-executor317 Not tainted
6.12.0-rc5-syzkaller-00291-g05b92660cdfe #0
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 09/13/2024
RIP: 0010:__tcp_select_window+0x5b4/0x1310 net/ipv4/tcp_output.c:3163
Code: f6 44 01 e3 89 df e8 9b 75 09 f8 44 39 f3 0f 8d 11 ff ff ff e8
0d 74 09 f8 45 89 f4 e9 04 ff ff ff e8 00 74 09 f8 44 89 f0 99 <f7> 7c
24 14 41 29 d6 45 89 f4 e9 ec fe ff ff e8 e8 73 09 f8 48 89
RSP: 0018:ffffc900041f7930 EFLAGS: 00010293
RAX: 0000000000017e67 RBX: 0000000000017e67 RCX: ffffffff8983314b
RDX: 0000000000000000 RSI: ffffffff898331b0 RDI: 0000000000000004
RBP: 00000000005d6000 R08: 0000000000000004 R09: 0000000000017e67
R10: 0000000000003e80 R11: 0000000000000000 R12: 0000000000003e80
R13: ffff888031d9b440 R14: 0000000000017e67 R15: 00000000002eb000
FS: 00007feb5d7f16c0(0000) GS:ffff8880b8700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007feb5d8adbb8 CR3: 0000000074e4c000 CR4: 00000000003526f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__tcp_cleanup_rbuf+0x3e7/0x4b0 net/ipv4/tcp.c:1493
mptcp_rcv_space_adjust net/mptcp/protocol.c:2085 [inline]
mptcp_recvmsg+0x2156/0x2600 net/mptcp/protocol.c:2289
inet_recvmsg+0x469/0x6a0 net/ipv4/af_inet.c:885
sock_recvmsg_nosec net/socket.c:1051 [inline]
sock_recvmsg+0x1b2/0x250 net/socket.c:1073
__sys_recvfrom+0x1a5/0x2e0 net/socket.c:2265
__do_sys_recvfrom net/socket.c:2283 [inline]
__se_sys_recvfrom net/socket.c:2279 [inline]
__x64_sys_recvfrom+0xe0/0x1c0 net/socket.c:2279
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7feb5d857559
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 51 18 00 00 90 48 89 f8 48
89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007feb5d7f1208 EFLAGS: 00000246 ORIG_RAX: 000000000000002d
RAX: ffffffffffffffda RBX: 00007feb5d8e1318 RCX: 00007feb5d857559
RDX: 000000800000000e RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00007feb5d8e1310 R08: 0000000000000000 R09: ffffffff81000000
R10: 0000000000000100 R11: 0000000000000246 R12: 00007feb5d8e131c
R13: 00007feb5d8ae074 R14: 000000800000000e R15: 00000000fffffdef

and provided a nice reproducer.

The root cause is the current bad handling of racing disconnect.
After the blamed commit below, sk_wait_data() can return (with
error) with the underlying socket disconnected and a zero rcv_mss.

Catch the error and return without performing any additional
operations on the current socket.

Reported-by: Eric Dumazet <[email protected]>
Fixes: 419ce133ab92 ("tcp: allow again tcp_disconnect() when threads are waiting")
Signed-off-by: Paolo Abeni <[email protected]>
Reviewed-by: Matthieu Baerts (NGI0) <[email protected]>
Link: https://patch.msgid.link/8c82ecf71662ecbc47bf390f9905de70884c9f2d.1731060874.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agonet: clarify SO_DEVMEM_DONTNEED behavior in documentation
Mina Almasry [Thu, 7 Nov 2024 21:03:31 +0000 (21:03 +0000)]
net: clarify SO_DEVMEM_DONTNEED behavior in documentation

Document new behavior when the number of frags passed is too big.

Signed-off-by: Mina Almasry <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agonet: fix SO_DEVMEM_DONTNEED looping too long
Mina Almasry [Thu, 7 Nov 2024 21:03:30 +0000 (21:03 +0000)]
net: fix SO_DEVMEM_DONTNEED looping too long

Exit early if we're freeing more than 1024 frags, to prevent
looping too long.

Also minor code cleanups:
- Flip checks to reduce indentation.
- Use sizeof(*tokens) everywhere for consistentcy.

Cc: Yi Lai <[email protected]>
Signed-off-by: Mina Almasry <[email protected]>
Acked-by: Stanislav Fomichev <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agonommu: pass NULL argument to vma_iter_prealloc()
Hajime Tazaki [Fri, 8 Nov 2024 22:28:34 +0000 (07:28 +0900)]
nommu: pass NULL argument to vma_iter_prealloc()

When deleting a vma entry from a maple tree, it has to pass NULL to
vma_iter_prealloc() in order to calculate internal state of the tree, but
it passed a wrong argument.  As a result, nommu kernels crashed upon
accessing a vma iterator, such as acct_collect() reading the size of vma
entries after do_munmap().

This commit fixes this issue by passing a right argument to the
preallocation call.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: b5df09226450 ("mm: set up vma iterator for vma_iter_prealloc() calls")
Signed-off-by: Hajime Tazaki <[email protected]>
Reviewed-by: Liam R. Howlett <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
4 months agoocfs2: fix UBSAN warning in ocfs2_verify_volume()
Dmitry Antipov [Wed, 6 Nov 2024 09:21:00 +0000 (12:21 +0300)]
ocfs2: fix UBSAN warning in ocfs2_verify_volume()

Syzbot has reported the following splat triggered by UBSAN:

UBSAN: shift-out-of-bounds in fs/ocfs2/super.c:2336:10
shift exponent 32768 is too large for 32-bit type 'int'
CPU: 2 UID: 0 PID: 5255 Comm: repro Not tainted 6.12.0-rc4-syzkaller-00047-gc2ee9f594da8 #0
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x241/0x360
 ? __pfx_dump_stack_lvl+0x10/0x10
 ? __pfx__printk+0x10/0x10
 ? __asan_memset+0x23/0x50
 ? lockdep_init_map_type+0xa1/0x910
 __ubsan_handle_shift_out_of_bounds+0x3c8/0x420
 ocfs2_fill_super+0xf9c/0x5750
 ? __pfx_ocfs2_fill_super+0x10/0x10
 ? __pfx_validate_chain+0x10/0x10
 ? __pfx_validate_chain+0x10/0x10
 ? validate_chain+0x11e/0x5920
 ? __lock_acquire+0x1384/0x2050
 ? __pfx_validate_chain+0x10/0x10
 ? string+0x26a/0x2b0
 ? widen_string+0x3a/0x310
 ? string+0x26a/0x2b0
 ? bdev_name+0x2b1/0x3c0
 ? pointer+0x703/0x1210
 ? __pfx_pointer+0x10/0x10
 ? __pfx_format_decode+0x10/0x10
 ? __lock_acquire+0x1384/0x2050
 ? vsnprintf+0x1ccd/0x1da0
 ? snprintf+0xda/0x120
 ? __pfx_lock_release+0x10/0x10
 ? do_raw_spin_lock+0x14f/0x370
 ? __pfx_snprintf+0x10/0x10
 ? set_blocksize+0x1f9/0x360
 ? sb_set_blocksize+0x98/0xf0
 ? setup_bdev_super+0x4e6/0x5d0
 mount_bdev+0x20c/0x2d0
 ? __pfx_ocfs2_fill_super+0x10/0x10
 ? __pfx_mount_bdev+0x10/0x10
 ? vfs_parse_fs_string+0x190/0x230
 ? __pfx_vfs_parse_fs_string+0x10/0x10
 legacy_get_tree+0xf0/0x190
 ? __pfx_ocfs2_mount+0x10/0x10
 vfs_get_tree+0x92/0x2b0
 do_new_mount+0x2be/0xb40
 ? __pfx_do_new_mount+0x10/0x10
 __se_sys_mount+0x2d6/0x3c0
 ? __pfx___se_sys_mount+0x10/0x10
 ? do_syscall_64+0x100/0x230
 ? __x64_sys_mount+0x20/0xc0
 do_syscall_64+0xf3/0x230
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f37cae96fda
Code: 48 8b 0d 51 ce 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1e ce 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007fff6c1aa228 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00007fff6c1aa240 RCX: 00007f37cae96fda
RDX: 00000000200002c0 RSI: 0000000020000040 RDI: 00007fff6c1aa240
RBP: 0000000000000004 R08: 00007fff6c1aa280 R09: 0000000000000000
R10: 00000000000008c0 R11: 0000000000000206 R12: 00000000000008c0
R13: 00007fff6c1aa280 R14: 0000000000000003 R15: 0000000001000000
 </TASK>

For a really damaged superblock, the value of 'i_super.s_blocksize_bits'
may exceed the maximum possible shift for an underlying 'int'.  So add an
extra check whether the aforementioned field represents the valid block
size, which is 512 bytes, 1K, 2K, or 4K.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem")
Signed-off-by: Dmitry Antipov <[email protected]>
Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=56f7cd1abe4b8e475180
Reviewed-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Jun Piao <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
4 months agonilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint
Ryusuke Konishi [Wed, 6 Nov 2024 16:07:33 +0000 (01:07 +0900)]
nilfs2: fix null-ptr-deref in block_dirty_buffer tracepoint

When using the "block:block_dirty_buffer" tracepoint, mark_buffer_dirty()
may cause a NULL pointer dereference, or a general protection fault when
KASAN is enabled.

This happens because, since the tracepoint was added in
mark_buffer_dirty(), it references the dev_t member bh->b_bdev->bd_dev
regardless of whether the buffer head has a pointer to a block_device
structure.

In the current implementation, nilfs_grab_buffer(), which grabs a buffer
to read (or create) a block of metadata, including b-tree node blocks,
does not set the block device, but instead does so only if the buffer is
not in the "uptodate" state for each of its caller block reading
functions.  However, if the uptodate flag is set on a folio/page, and the
buffer heads are detached from it by try_to_free_buffers(), and new buffer
heads are then attached by create_empty_buffers(), the uptodate flag may
be restored to each buffer without the block device being set to
bh->b_bdev, and mark_buffer_dirty() may be called later in that state,
resulting in the bug mentioned above.

Fix this issue by making nilfs_grab_buffer() always set the block device
of the super block structure to the buffer head, regardless of the state
of the buffer's uptodate flag.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 5305cb830834 ("block: add block_{touch|dirty}_buffer tracepoint")
Signed-off-by: Ryusuke Konishi <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Ubisectech Sirius <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
4 months agonilfs2: fix null-ptr-deref in block_touch_buffer tracepoint
Ryusuke Konishi [Wed, 6 Nov 2024 16:07:32 +0000 (01:07 +0900)]
nilfs2: fix null-ptr-deref in block_touch_buffer tracepoint

Patch series "nilfs2: fix null-ptr-deref bugs on block tracepoints".

This series fixes null pointer dereference bugs that occur when using
nilfs2 and two block-related tracepoints.

This patch (of 2):

It has been reported that when using "block:block_touch_buffer"
tracepoint, touch_buffer() called from __nilfs_get_folio_block() causes a
NULL pointer dereference, or a general protection fault when KASAN is
enabled.

This happens because since the tracepoint was added in touch_buffer(), it
references the dev_t member bh->b_bdev->bd_dev regardless of whether the
buffer head has a pointer to a block_device structure.  In the current
implementation, the block_device structure is set after the function
returns to the caller.

Here, touch_buffer() is used to mark the folio/page that owns the buffer
head as accessed, but the common search helper for folio/page used by the
caller function was optimized to mark the folio/page as accessed when it
was reimplemented a long time ago, eliminating the need to call
touch_buffer() here in the first place.

So this solves the issue by eliminating the touch_buffer() call itself.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 5305cb830834 ("block: add block_{touch|dirty}_buffer tracepoint")
Signed-off-by: Ryusuke Konishi <[email protected]>
Reported-by: Ubisectech Sirius <[email protected]>
Closes: https://lkml.kernel.org/r/[email protected]
Reported-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=9982fb8d18eba905abe2
Tested-by: [email protected]
Cc: Tejun Heo <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
4 months agomm: page_alloc: move mlocked flag clearance into free_pages_prepare()
Roman Gushchin [Wed, 6 Nov 2024 19:53:54 +0000 (19:53 +0000)]
mm: page_alloc: move mlocked flag clearance into free_pages_prepare()

Syzbot reported a bad page state problem caused by a page being freed
using free_page() still having a mlocked flag at free_pages_prepare()
stage:

  BUG: Bad page state in process syz.5.504  pfn:61f45
  page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x61f45
  flags: 0xfff00000080204(referenced|workingset|mlocked|node=0|zone=1|lastcpupid=0x7ff)
  raw: 00fff00000080204 0000000000000000 dead000000000122 0000000000000000
  raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
  page_owner tracks the page as allocated
  page last allocated via order 0, migratetype Unmovable, gfp_mask 0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), pid 8443, tgid 8442 (syz.5.504), ts 201884660643, free_ts 201499827394
   set_page_owner include/linux/page_owner.h:32 [inline]
   post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537
   prep_new_page mm/page_alloc.c:1545 [inline]
   get_page_from_freelist+0x303f/0x3190 mm/page_alloc.c:3457
   __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4733
   alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265
   kvm_coalesced_mmio_init+0x1f/0xf0 virt/kvm/coalesced_mmio.c:99
   kvm_create_vm virt/kvm/kvm_main.c:1235 [inline]
   kvm_dev_ioctl_create_vm virt/kvm/kvm_main.c:5488 [inline]
   kvm_dev_ioctl+0x12dc/0x2240 virt/kvm/kvm_main.c:5530
   __do_compat_sys_ioctl fs/ioctl.c:1007 [inline]
   __se_compat_sys_ioctl+0x510/0xc90 fs/ioctl.c:950
   do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
   __do_fast_syscall_32+0xb4/0x110 arch/x86/entry/common.c:386
   do_fast_syscall_32+0x34/0x80 arch/x86/entry/common.c:411
   entry_SYSENTER_compat_after_hwframe+0x84/0x8e
  page last free pid 8399 tgid 8399 stack trace:
   reset_page_owner include/linux/page_owner.h:25 [inline]
   free_pages_prepare mm/page_alloc.c:1108 [inline]
   free_unref_folios+0xf12/0x18d0 mm/page_alloc.c:2686
   folios_put_refs+0x76c/0x860 mm/swap.c:1007
   free_pages_and_swap_cache+0x5c8/0x690 mm/swap_state.c:335
   __tlb_batch_free_encoded_pages mm/mmu_gather.c:136 [inline]
   tlb_batch_pages_flush mm/mmu_gather.c:149 [inline]
   tlb_flush_mmu_free mm/mmu_gather.c:366 [inline]
   tlb_flush_mmu+0x3a3/0x680 mm/mmu_gather.c:373
   tlb_finish_mmu+0xd4/0x200 mm/mmu_gather.c:465
   exit_mmap+0x496/0xc40 mm/mmap.c:1926
   __mmput+0x115/0x390 kernel/fork.c:1348
   exit_mm+0x220/0x310 kernel/exit.c:571
   do_exit+0x9b2/0x28e0 kernel/exit.c:926
   do_group_exit+0x207/0x2c0 kernel/exit.c:1088
   __do_sys_exit_group kernel/exit.c:1099 [inline]
   __se_sys_exit_group kernel/exit.c:1097 [inline]
   __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1097
   x64_sys_call+0x2634/0x2640 arch/x86/include/generated/asm/syscalls_64.h:232
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
  Modules linked in:
  CPU: 0 UID: 0 PID: 8442 Comm: syz.5.504 Not tainted 6.12.0-rc6-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
  Call Trace:
   <TASK>
   __dump_stack lib/dump_stack.c:94 [inline]
   dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
   bad_page+0x176/0x1d0 mm/page_alloc.c:501
   free_page_is_bad mm/page_alloc.c:918 [inline]
   free_pages_prepare mm/page_alloc.c:1100 [inline]
   free_unref_page+0xed0/0xf20 mm/page_alloc.c:2638
   kvm_destroy_vm virt/kvm/kvm_main.c:1327 [inline]
   kvm_put_kvm+0xc75/0x1350 virt/kvm/kvm_main.c:1386
   kvm_vcpu_release+0x54/0x60 virt/kvm/kvm_main.c:4143
   __fput+0x23f/0x880 fs/file_table.c:431
   task_work_run+0x24f/0x310 kernel/task_work.c:239
   exit_task_work include/linux/task_work.h:43 [inline]
   do_exit+0xa2f/0x28e0 kernel/exit.c:939
   do_group_exit+0x207/0x2c0 kernel/exit.c:1088
   __do_sys_exit_group kernel/exit.c:1099 [inline]
   __se_sys_exit_group kernel/exit.c:1097 [inline]
   __ia32_sys_exit_group+0x3f/0x40 kernel/exit.c:1097
   ia32_sys_call+0x2624/0x2630 arch/x86/include/generated/asm/syscalls_32.h:253
   do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline]
   __do_fast_syscall_32+0xb4/0x110 arch/x86/entry/common.c:386
   do_fast_syscall_32+0x34/0x80 arch/x86/entry/common.c:411
   entry_SYSENTER_compat_after_hwframe+0x84/0x8e
  RIP: 0023:0xf745d579
  Code: Unable to access opcode bytes at 0xf745d54f.
  RSP: 002b:00000000f75afd6c EFLAGS: 00000206 ORIG_RAX: 00000000000000fc
  RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: 00000000ffffff9c RDI: 00000000f744cff4
  RBP: 00000000f717ae61 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
  R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
   </TASK>

The problem was originally introduced by commit b109b87050df ("mm/munlock:
replace clear_page_mlock() by final clearance"): it was focused on
handling pagecache and anonymous memory and wasn't suitable for lower
level get_page()/free_page() API's used for example by KVM, as with this
reproducer.

Fix it by moving the mlocked flag clearance down to free_page_prepare().

The bug itself if fairly old and harmless (aside from generating these
warnings), aside from a small memory leak - "bad" pages are stopped from
being allocated again.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: b109b87050df ("mm/munlock: replace clear_page_mlock() by final clearance")
Signed-off-by: Roman Gushchin <[email protected]>
Reported-by: [email protected]
Closes: https://lore.kernel.org/all/[email protected]
Acked-by: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Sean Christopherson <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
4 months agonet: fix data-races around sk->sk_forward_alloc
Wang Liang [Thu, 7 Nov 2024 02:34:05 +0000 (10:34 +0800)]
net: fix data-races around sk->sk_forward_alloc

Syzkaller reported this warning:
 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 16 at net/ipv4/af_inet.c:156 inet_sock_destruct+0x1c5/0x1e0
 Modules linked in:
 CPU: 0 UID: 0 PID: 16 Comm: ksoftirqd/0 Not tainted 6.12.0-rc5 #26
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
 RIP: 0010:inet_sock_destruct+0x1c5/0x1e0
 Code: 24 12 4c 89 e2 5b 48 c7 c7 98 ec bb 82 41 5c e9 d1 18 17 ff 4c 89 e6 5b 48 c7 c7 d0 ec bb 82 41 5c e9 bf 18 17 ff 0f 0b eb 83 <0f> 0b eb 97 0f 0b eb 87 0f 0b e9 68 ff ff ff 66 66 2e 0f 1f 84 00
 RSP: 0018:ffffc9000008bd90 EFLAGS: 00010206
 RAX: 0000000000000300 RBX: ffff88810b172a90 RCX: 0000000000000007
 RDX: 0000000000000002 RSI: 0000000000000300 RDI: ffff88810b172a00
 RBP: ffff88810b172a00 R08: ffff888104273c00 R09: 0000000000100007
 R10: 0000000000020000 R11: 0000000000000006 R12: ffff88810b172a00
 R13: 0000000000000004 R14: 0000000000000000 R15: ffff888237c31f78
 FS:  0000000000000000(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007ffc63fecac8 CR3: 000000000342e000 CR4: 00000000000006f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  ? __warn+0x88/0x130
  ? inet_sock_destruct+0x1c5/0x1e0
  ? report_bug+0x18e/0x1a0
  ? handle_bug+0x53/0x90
  ? exc_invalid_op+0x18/0x70
  ? asm_exc_invalid_op+0x1a/0x20
  ? inet_sock_destruct+0x1c5/0x1e0
  __sk_destruct+0x2a/0x200
  rcu_do_batch+0x1aa/0x530
  ? rcu_do_batch+0x13b/0x530
  rcu_core+0x159/0x2f0
  handle_softirqs+0xd3/0x2b0
  ? __pfx_smpboot_thread_fn+0x10/0x10
  run_ksoftirqd+0x25/0x30
  smpboot_thread_fn+0xdd/0x1d0
  kthread+0xd3/0x100
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x34/0x50
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 ---[ end trace 0000000000000000 ]---

Its possible that two threads call tcp_v6_do_rcv()/sk_forward_alloc_add()
concurrently when sk->sk_state == TCP_LISTEN with sk->sk_lock unlocked,
which triggers a data-race around sk->sk_forward_alloc:
tcp_v6_rcv
    tcp_v6_do_rcv
        skb_clone_and_charge_r
            sk_rmem_schedule
                __sk_mem_schedule
                    sk_forward_alloc_add()
            skb_set_owner_r
                sk_mem_charge
                    sk_forward_alloc_add()
        __kfree_skb
            skb_release_all
                skb_release_head_state
                    sock_rfree
                        sk_mem_uncharge
                            sk_forward_alloc_add()
                            sk_mem_reclaim
                                // set local var reclaimable
                                __sk_mem_reclaim
                                    sk_forward_alloc_add()

In this syzkaller testcase, two threads call
tcp_v6_do_rcv() with skb->truesize=768, the sk_forward_alloc changes like
this:
 (cpu 1)             | (cpu 2)             | sk_forward_alloc
 ...                 | ...                 | 0
 __sk_mem_schedule() |                     | +4096 = 4096
                     | __sk_mem_schedule() | +4096 = 8192
 sk_mem_charge()     |                     | -768  = 7424
                     | sk_mem_charge()     | -768  = 6656
 ...                 |    ...              |
 sk_mem_uncharge()   |                     | +768  = 7424
 reclaimable=7424    |                     |
                     | sk_mem_uncharge()   | +768  = 8192
                     | reclaimable=8192    |
 __sk_mem_reclaim()  |                     | -4096 = 4096
                     | __sk_mem_reclaim()  | -8192 = -4096 != 0

The skb_clone_and_charge_r() should not be called in tcp_v6_do_rcv() when
sk->sk_state is TCP_LISTEN, it happens later in tcp_v6_syn_recv_sock().
Fix the same issue in dccp_v6_do_rcv().

Suggested-by: Eric Dumazet <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets")
Signed-off-by: Wang Liang <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agoMerge tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Mon, 11 Nov 2024 22:09:57 +0000 (14:09 -0800)]
Merge tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - The fair sched class currently has a bug where its balance() returns
   true telling the sched core that it has tasks to run but then NULL
   from pick_task(). This makes sched core call sched_ext's pick_task()
   without preceding balance() which can lead to stalls in partial mode.

   For now, work around by detecting the condition and forcing the CPU
   to go through another scheduling cycle.

 - Add a missing newline to an error message and fix drgn introspection
   tool which went out of sync.

* tag 'sched_ext-for-6.12-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()
  sched_ext: Update scx_show_state.py to match scx_ops_bypass_depth's new type
  sched_ext: Add a missing newline at the end of an error message

4 months agoMerge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Linus Torvalds [Mon, 11 Nov 2024 17:06:17 +0000 (09:06 -0800)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio fixes from Michael Tsirkin:
 "Several small bugfixes all over the place"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  vdpa/mlx5: Fix error path during device add
  vp_vdpa: fix id_table array not null terminated error
  virtio_pci: Fix admin vq cleanup by using correct info pointer
  vDPA/ifcvf: Fix pci_read_config_byte() return code handling
  Fix typo in vringh_test.c
  vdpa: solidrun: Fix UB bug with devres
  vsock/virtio: Initialization of the dangling pointer occurring in vsk->trans

4 months agodm-cache: fix warnings about duplicate slab caches
Mikulas Patocka [Mon, 11 Nov 2024 15:51:02 +0000 (16:51 +0100)]
dm-cache: fix warnings about duplicate slab caches

The commit 4c39529663b9 adds a warning about duplicate cache names if
CONFIG_DEBUG_VM is selected. These warnings are triggered by the dm-cache
code.

The dm-cache code allocates a slab cache for each device. This commit
changes it to allocate just one slab cache in the module init function.

Signed-off-by: Mikulas Patocka <[email protected]>
Fixes: 4c39529663b9 ("slab: Warn on duplicate cache names when DEBUG_VM=y")
4 months agodm-bufio: fix warnings about duplicate slab caches
Mikulas Patocka [Mon, 11 Nov 2024 15:48:18 +0000 (16:48 +0100)]
dm-bufio: fix warnings about duplicate slab caches

The commit 4c39529663b9 adds a warning about duplicate cache names if
CONFIG_DEBUG_VM is selected. These warnings are triggered by the dm-bufio
code. The dm-bufio code allocates a slab cache with each client. It is
not possible to preallocate the caches in the module init function
because the size of auxiliary per-buffer data is not known at this point.

So, this commit changes dm-bufio so that it appends a unique atomic value
to the cache name, to avoid the warnings.

Signed-off-by: Mikulas Patocka <[email protected]>
Fixes: 4c39529663b9 ("slab: Warn on duplicate cache names when DEBUG_VM=y")
4 months agocpufreq: intel_pstate: Rearrange locking in hybrid_init_cpu_capacity_scaling()
Rafael J. Wysocki [Thu, 7 Nov 2024 12:36:10 +0000 (13:36 +0100)]
cpufreq: intel_pstate: Rearrange locking in hybrid_init_cpu_capacity_scaling()

Notice that hybrid_init_cpu_capacity_scaling() only needs to hold
hybrid_capacity_lock around __hybrid_init_cpu_capacity_scaling()
calls, so introduce a "locked" wrapper around the latter and call
it from the former.  This allows to drop a local variable and a
label that are not needed any more.

Also, rename __hybrid_init_cpu_capacity_scaling() to
__hybrid_refresh_cpu_capacity_scaling() for consistency.

Interestingly enough, this fixes a locking issue introduced by commit
929ebc93ccaa ("cpufreq: intel_pstate: Set asymmetric CPU capacity on
hybrid systems") that put an arch_enable_hybrid_capacity_scale() call
under hybrid_capacity_lock, which was a mistake because the latter is
acquired in CPU hotplug paths and so it cannot be held around
cpus_read_lock() calls.

Link: https://lore.kernel.org/linux-pm/SJ1PR11MB6129EDBF22F8A90FC3A3EDC8B9582@SJ1PR11MB6129.namprd11.prod.outlook.com/
Fixes: 929ebc93ccaa ("cpufreq: intel_pstate: Set asymmetric CPU capacity on hybrid systems")
Signed-off-by: Rafael J. Wysocki <[email protected]>
Reported-by: "Borah, Chaitanya Kumar" <[email protected]>
Link: https://patch.msgid.link/[email protected]
[ rjw: Changelog update ]
Signed-off-by: Rafael J. Wysocki <[email protected]>
4 months agomm: count zeromap read and set for swapout and swapin
Barry Song [Thu, 7 Nov 2024 01:12:46 +0000 (14:12 +1300)]
mm: count zeromap read and set for swapout and swapin

When the proportion of folios from the zeromap is small, missing their
accounting may not significantly impact profiling.  However, it's easy to
construct a scenario where this becomes an issue—for example, allocating
1 GB of memory, writing zeros from userspace, followed by MADV_PAGEOUT,
and then swapping it back in.  In this case, the swap-out and swap-in
counts seem to vanish into a black hole, potentially causing semantic
ambiguity.

On the other hand, Usama reported that zero-filled pages can exceed 10% in
workloads utilizing zswap, while Hailong noted that some app in Android
have more than 6% zero-filled pages.  Before commit 0ca0c24e3211 ("mm:
store zero pages to be swapped out in a bitmap"), both zswap and zRAM
implemented similar optimizations, leading to these optimized-out pages
being counted in either zswap or zRAM counters (with pswpin/pswpout also
increasing for zRAM).  With zeromap functioning prior to both zswap and
zRAM, userspace will no longer detect these swap-out and swap-in actions.

We have three ways to address this:

1. Introduce a dedicated counter specifically for the zeromap.

2. Use pswpin/pswpout accounting, treating the zero map as a standard
   backend.  This approach aligns with zRAM's current handling of
   same-page fills at the device level.  However, it would mean losing the
   optimized-out page counters previously available in zRAM and would not
   align with systems using zswap.  Additionally, as noted by Nhat Pham,
   pswpin/pswpout counters apply only to I/O done directly to the backend
   device.

3. Count zeromap pages under zswap, aligning with system behavior when
   zswap is enabled.  However, this would not be consistent with zRAM, nor
   would it align with systems lacking both zswap and zRAM.

Given the complications with options 2 and 3, this patch selects
option 1.

We can find these counters from /proc/vmstat (counters for the whole
system) and memcg's memory.stat (counters for the interested memcg).

For example:

$ grep -E 'swpin_zero|swpout_zero' /proc/vmstat
swpin_zero 1648
swpout_zero 33536

$ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat
swpin_zero 3905
swpout_zero 3985

This patch does not address any specific zeromap bug, but the missing
swpout and swpin counts for zero-filled pages can be highly confusing and
may mislead user-space agents that rely on changes in these counters as
indicators.  Therefore, we add a Fixes tag to encourage the inclusion of
this counter in any kernel versions with zeromap.

Many thanks to Kanchana for the contribution of changing
count_objcg_event() to count_objcg_events() to support large folios[1],
which has now been incorporated into this patch.

[1] https://lkml.kernel.org/r/20241001053222[email protected]

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap")
Co-developed-by: Kanchana P Sridhar <[email protected]>
Signed-off-by: Barry Song <[email protected]>
Reviewed-by: Nhat Pham <[email protected]>
Reviewed-by: Chengming Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Usama Arif <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Cc: Hailong Liu <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Chris Li <[email protected]>
Cc: "Huang, Ying" <[email protected]>
Cc: Kairui Song <[email protected]>
Cc: Ryan Roberts <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
4 months agobcachefs: Allow for unknown key types in backpointers fsck
Kent Overstreet [Mon, 11 Nov 2024 04:28:33 +0000 (23:28 -0500)]
bcachefs: Allow for unknown key types in backpointers fsck

We can't assume that btrees only contain keys of a given type - even if
they only have a single key type listed in the allowed key types for
that btree; this is a forwards compatibility issue.

Reported-by: [email protected]
Signed-off-by: Kent Overstreet <[email protected]>
4 months agobcachefs: Fix assertion pop in topology repair
Kent Overstreet [Mon, 11 Nov 2024 03:01:04 +0000 (22:01 -0500)]
bcachefs: Fix assertion pop in topology repair

Fixes: baefd3f849ed ("bcachefs: btree_cache.freeable list fixes")
Signed-off-by: Kent Overstreet <[email protected]>
4 months agoLinux 6.12-rc7 v6.12-rc7
Linus Torvalds [Sun, 10 Nov 2024 22:19:35 +0000 (14:19 -0800)]
Linux 6.12-rc7

4 months agoMerge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 10 Nov 2024 22:16:28 +0000 (14:16 -0800)]
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

Pull clk fixes from Stephen Boyd:
 "A handful of Qualcomm clk driver fixes:

   - Correct flags for X Elite USB MP GDSC and pcie pipediv2 clocks

   - Fix alpha PLL post_div mask for the cases where width is not
     specified

   - Avoid hangs in the SM8350 video driver (venus) by setting HW_CTRL
     trigger feature on the video clocks"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: qcom: gcc-x1e80100: Fix USB MP SS1 PHY GDSC pwrsts flags
  clk: qcom: gcc-x1e80100: Fix halt_check for pipediv2 clocks
  clk: qcom: clk-alpha-pll: Fix pll post div mask when width is not set
  clk: qcom: videocc-sm8350: use HW_CTRL_TRIGGER for vcodec GDSCs

4 months agoMerge tag 'i2c-for-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa...
Linus Torvalds [Sun, 10 Nov 2024 22:13:05 +0000 (14:13 -0800)]
Merge tag 'i2c-for-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:
 "i2c-host fixes for v6.12-rc7 (from Andi):

   - Fix designware incorrect behavior when concluding a transmission

   - Fix Mule multiplexer error value evaluation"

* tag 'i2c-for-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: designware: do not hold SCL low when I2C_DYNAMIC_TAR_UPDATE is not set
  i2c: muxes: Fix return value check in mule_i2c_mux_probe()

4 months agofilemap: Fix bounds checking in filemap_read()
Trond Myklebust [Fri, 13 Sep 2024 17:57:04 +0000 (13:57 -0400)]
filemap: Fix bounds checking in filemap_read()

If the caller supplies an iocb->ki_pos value that is close to the
filesystem upper limit, and an iterator with a count that causes us to
overflow that limit, then filemap_read() enters an infinite loop.

This behaviour was discovered when testing xfstests generic/525 with the
"localio" optimisation for loopback NFS mounts.

Reported-by: Mike Snitzer <[email protected]>
Fixes: c2a9737f45e2 ("vfs,mm: fix a dead loop in truncate_inode_pages_range()")
Tested-by: Mike Snitzer <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
4 months agoMerge tag 'irq_urgent_for_v6.12_rc7' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 10 Nov 2024 17:37:47 +0000 (09:37 -0800)]
Merge tag 'irq_urgent_for_v6.12_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fix from Borislav Petkov:

 - Make sure GICv3 controller interrupt activation doesn't race with a
   concurrent deactivation due to propagation delays of the register
   write

* tag 'irq_urgent_for_v6.12_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/gic-v3: Force propagation of the active state with a read-back

4 months agoMerge tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Sun, 10 Nov 2024 17:04:27 +0000 (09:04 -0800)]
Merge tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "20 hotfixes, 14 of which are cc:stable.

  Three affect DAMON. Lorenzo's five-patch series to address the
  mmap_region error handling is here also.

  Apart from that, various singletons"

* tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mailmap: add entry for Thorsten Blum
  ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove()
  signal: restore the override_rlimit logic
  fs/proc: fix compile warning about variable 'vmcore_mmap_ops'
  ucounts: fix counter leak in inc_rlimit_get_ucounts()
  selftests: hugetlb_dio: check for initial conditions to skip in the start
  mm: fix docs for the kernel parameter ``thp_anon=``
  mm/damon/core: avoid overflow in damon_feed_loop_next_input()
  mm/damon/core: handle zero schemes apply interval
  mm/damon/core: handle zero {aggregation,ops_update} intervals
  mm/mlock: set the correct prev on failure
  objpool: fix to make percpu slot allocation more robust
  mm/page_alloc: keep track of free highatomic
  mm: resolve faulty mmap_region() error path behaviour
  mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling
  mm: refactor map_deny_write_exec()
  mm: unconditionally close VMAs on error
  mm: avoid unsafe VMA hook invocation when error arises on mmap hook
  mm/thp: fix deferred split unqueue naming and locking
  mm/thp: fix deferred split queue not partially_mapped

4 months agoMerge tag 'usb-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Linus Torvalds [Sun, 10 Nov 2024 16:56:48 +0000 (08:56 -0800)]
Merge tag 'usb-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB/Thunderbolt fixes from Greg KH:
 "Here are some small remaining USB and Thunderbolt fixes and device ids
  for 6.12-rc7. Included in here are:

   - new USB serial driver device ids

   - thunderbolt driver fixes for reported problems

   - typec bugfixes

   - dwc3 driver fix

   - musb driver fix

  All of these have been in linux-next this past week with no reported
  issues"

* tag 'usb-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  USB: serial: qcserial: add support for Sierra Wireless EM86xx
  thunderbolt: Fix connection issue with Pluggable UD-4VPD dock
  usb: typec: fix potential out of bounds in ucsi_ccg_update_set_new_cam_cmd()
  usb: dwc3: fix fault at system suspend if device was already runtime suspended
  usb: typec: qcom-pmic: init value of hdr_len/txbuf_len earlier
  usb: musb: sunxi: Fix accessing an released usb phy
  USB: serial: io_edgeport: fix use after free in debug printk
  USB: serial: option: add Quectel RG650V
  USB: serial: option: add Fibocom FG132 0x0112 composition
  thunderbolt: Add only on-board retimers when !CONFIG_USB4_DEBUGFS_MARGINING

4 months agoMerge tag 'staging-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh...
Linus Torvalds [Sun, 10 Nov 2024 16:53:24 +0000 (08:53 -0800)]
Merge tag 'staging-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging driver fixes from Greg KH:
 "Here are two small memory leak fixes for the vchiq_arm staging driver
  that have been sitting in my tree for weeks and should get merged for
  6.12-rc7 so that people don't keep tripping over them.

  They both have been in linux-next for a while with no reported
  problems"

* tag 'staging-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: vchiq_arm: Use devm_kzalloc() for drv_mgmt allocation
  staging: vchiq_arm: Use devm_kzalloc() for vchiq_arm_state allocation

4 months agoselftests: net: add netlink-dumps to .gitignore
Jakub Kicinski [Fri, 8 Nov 2024 00:47:31 +0000 (16:47 -0800)]
selftests: net: add netlink-dumps to .gitignore

Commit 55d42a0c3f9c ("selftests: net: add a test for closing
a netlink socket ith dump in progress") added a new test
but did not add it to gitignore.

Reviewed-by: Joe Damato <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agoMerge tag 'i2c-host-fixes-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel...
Wolfram Sang [Sat, 9 Nov 2024 22:47:51 +0000 (23:47 +0100)]
Merge tag 'i2c-host-fixes-6.12-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-current

i2c-host fixes for v6.12-rc7

In designware an incorrect behavior has been fixes when
concluding a transmission.

Fixed return error value evaluation in the Mule multiplexer.

4 months agonet: vertexcom: mse102x: Fix tx_bytes calculation
Stefan Wahren [Fri, 8 Nov 2024 11:43:43 +0000 (12:43 +0100)]
net: vertexcom: mse102x: Fix tx_bytes calculation

The tx_bytes should consider the actual size of the Ethernet frames
without the SPI encapsulation. But we still need to take care of
Ethernet padding.

Fixes: 2f207cbf0dd4 ("net: vertexcom: Add MSE102x SPI support")
Signed-off-by: Stefan Wahren <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Jakub Kicinski <[email protected]>
4 months agoMerge tag 'nfsd-6.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Linus Torvalds [Sat, 9 Nov 2024 21:18:07 +0000 (13:18 -0800)]
Merge tag 'nfsd-6.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fix from Chuck Lever:

 - Fix a v6.12-rc regression when exporting ext4 filesystems with NFSD

* tag 'nfsd-6.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: Fix READDIR on NFSv3 mounts of ext4 exports

4 months agoMerge tag 'v6.12-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sat, 9 Nov 2024 20:58:23 +0000 (12:58 -0800)]
Merge tag 'v6.12-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fix from Steve French:
 "Fix net namespace refcount use after free issue"

* tag 'v6.12-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
  smb: client: Fix use-after-free of network namespace.

4 months agoMerge tag 'block-6.12-20241108' of git://git.kernel.dk/linux
Linus Torvalds [Sat, 9 Nov 2024 20:55:32 +0000 (12:55 -0800)]
Merge tag 'block-6.12-20241108' of git://git.kernel.dk/linux

Pull block fix from Jens Axboe:
 "Single fix for an issue triggered with PROVE_RCU=y, with nvme using
  the wrong iterators for an SRCU protected list"

* tag 'block-6.12-20241108' of git://git.kernel.dk/linux:
  nvme/host: Fix RCU list traversal to use SRCU primitive

4 months agosched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()
Tejun Heo [Sat, 9 Nov 2024 20:43:55 +0000 (10:43 -1000)]
sched_ext: Handle cases where pick_task_scx() is called without preceding balance_scx()

sched_ext dispatches tasks from the BPF scheduler from balance_scx() and
thus every pick_task_scx() call must be preceded by balance_scx(). While
this usually holds, due to a bug, there are cases where the fair class's
balance() returns true indicating that it has tasks to run on the CPU and
thus terminating balance() calls but fails to actually find the next task to
run when pick_task() is called. In such cases, pick_task_scx() can be called
without preceding balance_scx().

Detect this condition using SCX_RQ_BAL_PENDING flags. If detected, keep
running the previous task if possible and avoid stalling from entering idle
without balancing.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
This page took 0.15309 seconds and 4 git commands to generate.