====================
ena: Link IRQs, queues, and NAPI instances
This series uses the netdev-genl API to link IRQs and queues to NAPI IDs
so that this information is queryable by user apps. This is particularly
useful for epoll-based busy polling apps which rely on having access to
the NAPI ID.
I've tested these commits on an EC2 instance with an ENA NIC configured
and have included test output in the commit messages for each patch
showing how to query the information.
I noted in the implementation that the driver requests an IRQ for
management purposes which does not have an associated NAPI. I tried
to take this into account in patch 1, but would appreciate if ENA
maintainers can verify I did this correctly.
Joe Damato [Wed, 2 Oct 2024 00:13:27 +0000 (00:13 +0000)]
ena: Link IRQs to NAPI instances
Link IRQs to NAPI instances with netif_napi_set_irq. This information
can be queried with the netdev-genl API. Note that the ENA device
appears to allocate an IRQ for management purposes which does not have a
NAPI associated with it; this commit takes this into consideration to
accurately construct a map between IRQs and NAPI instances.
Compare the output of /proc/interrupts for my ena device with the output of
netdev-genl after applying this patch:
====================
packing: various improvements and KUnit tests
This series contains a handful of improvements and fixes for the packing
library, including the addition of KUnit tests.
There are two major changes which might be considered bug fixes:
1) The library is updated to handle arbitrary buffer lengths, fixing
undefined behavior when operating on buffers which are not a multiple
of 4 bytes.
2) The behavior of QUIRK_MSB_ON_THE_RIGHT is fixed to match the intended
behavior when operating on packings that are not byte aligned.
These are not sent to net because no driver currently depends on this
behavior. For (1), the existing users of the packing API all operate on
buffers which are multiples of 4-bytes. For (2), no driver currently uses
QUIRK_MSB_ON_THE_RIGHT. The incorrect behavior was found while writing
KUnit tests.
This series also includes a handful of minor cleanups from Vladimir, as
well as a change to introduce a separated pack() and unpack() API. This API
is not (yet) used by a driver, but is the first step in implementing
pack_fields() and unpack_fields() which will be used in future changes for
the ice driver and changes Vladimir has in progress for other drivers using
the packing API.
This series is part 1 of a 2-part series for implementing use of
lib/packing in the ice driver. The 2nd part includes a new pack_fields()
and unpack_fields() implementation inspired by the ice driver's existing
bit packing code. It is built on top of the split pack() and unpack()
code. Additionally, the KUnit tests are built on top of pack() and
unpack(), based on original selftests written by Vladimir.
Fitting the entire library changes and drivers changes into a single series
exceeded the usual series limits.
Jacob Keller [Wed, 2 Oct 2024 21:51:57 +0000 (14:51 -0700)]
lib: packing: fix QUIRK_MSB_ON_THE_RIGHT behavior
The QUIRK_MSB_ON_THE_RIGHT quirk is intended to modify pack() and unpack()
so that the most significant bit of each byte in the packed layout is on
the right.
The way the quirk is currently implemented is broken whenever the packing
code packs or unpacks any value that is not exactly a full byte.
The broken behavior can occur when packing any values smaller than one
byte, when packing any value that is not exactly a whole number of bytes,
or when the packing is not aligned to a byte boundary.
This quirk is documented in the following way:
1. Normally (no quirks), we would do it like this:
That is, QUIRK_MSB_ON_THE_RIGHT does not affect byte positioning, but
inverts bit offsets inside a byte.
Essentially, the mapping for physical bit offsets should be reserved for a
given byte within the payload. This reversal should be fixed to the bytes
in the packing layout.
The logic to implement this quirk is handled within the
adjust_for_msb_right_quirk() function. This function does not work properly
when dealing with the bytes that contain only a partial amount of data.
In particular, consider trying to pack or unpack the range 53-44. We should
always be mapping the bits from the logical ordering to their physical
ordering in the same way, regardless of what sequence of bits we are
unpacking.
That is, it tries to keep the bits at the start and end of a packing
together. This is wrong, as it makes the packing change what bit is being
mapped to what based on which bits you're currently packing or unpacking.
Worse, the actual calculations within adjust_for_msb_right_quirk don't make
sense.
Consider the case when packing the last byte of an unaligned packing. It
might have a start bit of 7 and an end bit of 5. This would have a width of
3 bits. The new_start_bit will be calculated as the width - the box_end_bit
- 1. This will underflow and produce a negative value, which will
ultimate result in generating a new box_mask of all 0s.
For any other values, the result of the calculations of the
new_box_end_bit, new_box_start_bit, and the new box_mask will result in the
exact same values for the box_end_bit, box_start_bit, and box_mask. This
makes the calculations completely irrelevant.
If box_end_bit is 0, and box_start_bit is 7, then the entire function of
adjust_for_msb_right_quirk will boil down to just:
*to_write = bitrev8(*to_write)
The other adjustments are attempting (incorrectly) to keep the bits in the
same place but just reversed. This is not the right behavior even if
implemented correctly, as it leaves the mapping dependent on the bit values
being packed or unpacked.
Remove adjust_for_msb_right_quirk() and just use bitrev8 to reverse the
byte order when interacting with the packed data.
In particular, for packing, we need to reverse both the box_mask and the
physical value being packed. This is done after shifting the value by
box_end_bit so that the reversed mapping is always aligned to the physical
buffer byte boundary. The box_mask is reversed as we're about to use it to
clear any stale bits in the physical buffer at this block.
For unpacking, we need to reverse the contents of the physical buffer
*before* masking with the box_mask. This is critical, as the box_mask is a
logical mask of the bit layout before handling the QUIRK_MSB_ON_THE_RIGHT.
Add several new tests which cover this behavior. These tests will fail
without the fix and pass afterwards. Note that no current drivers make use
of QUIRK_MSB_ON_THE_RIGHT. I suspect this is why there have been no reports
of this inconsistency before.
Jacob Keller [Wed, 2 Oct 2024 21:51:56 +0000 (14:51 -0700)]
lib: packing: add additional KUnit tests
While reviewing the initial KUnit tests for lib/packing, Przemek pointed
out that the test values have duplicate bytes in the input sequence.
In addition, I noticed that the unit tests pack and unpack on a byte
boundary, instead of crossing bytes. Thus, we lack good coverage of the
corner cases of the API.
Add additional unit tests to cover packing and unpacking byte buffers which
do not have duplicate bytes in the unpacked value, and which pack and
unpack to an unaligned offset.
A careful reviewer may note the lack tests for QUIRK_MSB_ON_THE_RIGHT. This
is because I found issues with that quirk during test implementation. This
quirk will be fixed and the tests will be included in a future change.
Jacob Keller [Wed, 2 Oct 2024 21:51:55 +0000 (14:51 -0700)]
lib: packing: add KUnit tests adapted from selftests
Add 24 simple KUnit tests for the lib/packing.c pack() and unpack() APIs.
The first 16 tests exercise all combinations of quirks with a simple magic
number value on a 16-byte buffer. The remaining 8 tests cover
non-multiple-of-4 buffer sizes.
These tests were originally written by Vladimir as simple selftest
functions. I adapted them to KUnit, refactoring them into a table driven
approach. This will aid in adding additional tests in the future.
Vladimir Oltean [Wed, 2 Oct 2024 21:51:54 +0000 (14:51 -0700)]
lib: packing: duplicate pack() and unpack() implementations
packing() is now used in some hot paths, and it would be good to get rid
of some ifs and buts that depend on "op", to speed things up a little bit.
With the main implementations now taking size_t endbit, we no longer
have to check for negative values. Update the local integer variables to
also be size_t to match.
Vladimir Oltean [Wed, 2 Oct 2024 21:51:53 +0000 (14:51 -0700)]
lib: packing: add pack() and unpack() wrappers over packing()
Geert Uytterhoeven described packing() as "really bad API" because of
not being able to enforce const correctness. The same function is used
both when "pbuf" is input and "uval" is output, as in the other way
around.
Create 2 wrapper functions where const correctness can be ensured.
Do ugly type casts inside, to be able to reuse packing() as currently
implemented - which will _not_ modify the input argument.
Also, take the opportunity to change the type of startbit and endbit to
size_t - an unsigned type - in these new function prototypes. When int,
an extra check for negative values is necessary. Hopefully, when
packing() goes away completely, that check can be dropped.
My concern is that code which does rely on the conditional directionality
of packing() is harder to refactor without blowing up in size. So it may
take a while to completely eliminate packing(). But let's make alternatives
available for those who do not need that.
Vladimir Oltean [Wed, 2 Oct 2024 21:51:52 +0000 (14:51 -0700)]
lib: packing: remove kernel-doc from header file
It is not necessary to have the kernel-doc duplicated both in the
header and in the implementation. It is better to have it near the
implementation of the function, since in C, a function can have N
declarations, but only one definition.
Vladimir Oltean [Wed, 2 Oct 2024 21:51:51 +0000 (14:51 -0700)]
lib: packing: adjust definitions and implementation for arbitrary buffer lengths
Jacob Keller has a use case for packing() in the intel/ice networking
driver, but it cannot be used as-is.
Simply put, the API quirks for LSW32_IS_FIRST and LITTLE_ENDIAN are
naively implemented with the undocumented assumption that the buffer
length must be a multiple of 4. All calculations of group offsets and
offsets of bytes within groups assume that this is the case. But in the
ice case, this does not hold true. For example, packing into a buffer
of 22 bytes would yield wrong results, but pretending it was a 24 byte
buffer would work.
Rather than requiring such hacks, and leaving a big question mark when
it comes to discontinuities in the accessible bit fields of such buffer,
we should extend the packing API to support this use case.
It turns out that we can keep the design in terms of groups of 4 bytes,
but also make it work if the total length is not a multiple of 4.
Just like before, imagine the buffer as a big number, and its most
significant bytes (the ones that would make up to a multiple of 4) are
missing. Thus, with a big endian (no quirks) interpretation of the
buffer, those most significant bytes would be absent from the beginning
of the buffer, and with a LSW32_IS_FIRST interpretation, they would be
absent from the end of the buffer. The LITTLE_ENDIAN quirk, in the
packing() API world, only affects byte ordering within groups of 4.
Thus, it does not change which bytes are missing. Only the significance
of the remaining bytes within the (smaller) group.
No change intended for buffer sizes which are multiples of 4. Tested
with the sja1105 driver and with downstream unit tests.
Vladimir Oltean [Wed, 2 Oct 2024 21:51:50 +0000 (14:51 -0700)]
lib: packing: refuse operating on bit indices which exceed size of buffer
While reworking the implementation, it became apparent that this check
does not exist.
There is no functional issue yet, because at call sites, "startbit" and
"endbit" are always hardcoded to correct values, and never come from the
user.
Even with the upcoming support of arbitrary buffer lengths, the
"startbit >= 8 * pbuflen" check will remain correct. This is because
we intend to always interpret the packed buffer in a way that avoids
discontinuities in the available bit indices.
- sctp: set sk_state back to CLOSED if autobind fails in
sctp_listen_start
- mac802154: fix potential RCU dereference issue in
mac802154_scan_worker
- eth: fec: restart PPS after link state change"
* tag 'net-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (48 commits)
sctp: set sk_state back to CLOSED if autobind fails in sctp_listen_start
dt-bindings: net: xlnx,axi-ethernet: Add missing reg minItems
doc: net: napi: Update documentation for napi_schedule_irqoff
net/ncsi: Disable the ncsi work before freeing the associated structure
net: phy: qt2025: Fix warning: unused import DeviceId
gso: fix udp gso fraglist segmentation after pull from frag_list
bridge: mcast: Fail MDB get request on empty entry
vrf: revert "vrf: Remove unnecessary RCU-bh critical section"
net: ethernet: ti: am65-cpsw: Fix forever loop in cleanup code
net: phy: realtek: Check the index value in led_hw_control_get
ppp: do not assume bh is held in ppp_channel_bridge_input()
selftests: rds: move include.sh to TEST_FILES
net: test for not too small csum_start in virtio_net_hdr_to_skb()
net: gso: fix tcp fraglist segmentation after pull from frag_list
ipv4: ip_gre: Fix drops of small packets in ipgre_xmit
net: stmmac: dwmac4: extend timeout for VLAN Tag register busy bit check
net: add more sanity checks to qdisc_pkt_len_init()
net: avoid potential underflow in qdisc_pkt_len_init() with UFO
net: ethernet: ti: cpsw_ale: Fix warning on some platforms
net: microchip: Make FDMA config symbol invisible
...
Linus Torvalds [Thu, 3 Oct 2024 16:38:16 +0000 (09:38 -0700)]
Merge tag 'v6.12-rc1-ksmbd-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- small cleanup patches leveraging struct size to improve access bounds checking
* tag 'v6.12-rc1-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: Use struct_size() to improve smb_direct_rdma_xmit()
ksmbd: Annotate struct copychunk_ioctl_req with __counted_by_le()
ksmbd: Use struct_size() to improve get_file_alternate_info()
Linus Torvalds [Thu, 3 Oct 2024 16:22:50 +0000 (09:22 -0700)]
Merge tag 'vfs-6.12-rc2.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"vfs:
- Ensure that iter_folioq_get_pages() advances to the next slot
otherwise it will end up using the same folio with an out-of-bound
offset.
iomap:
- Dont unshare delalloc extents which can't be reflinked, and thus
can't be shared.
- Constrain the file range passed to iomap_file_unshare() directly in
iomap instead of requiring the callers to do it.
netfs:
- Use folioq_count instead of folioq_nr_slot to prevent an
unitialized value warning in netfs_clear_buffer().
- Fix missing wakeup after issuing writes by scheduling the write
collector only if all the subrequest queues are empty and thus no
writes are pending.
- Fix two minor documentation bugs"
* tag 'vfs-6.12-rc2.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iomap: constrain the file range passed to iomap_file_unshare
iomap: don't bother unsharing delalloc extents
netfs: Fix missing wakeup after issuing writes
Documentation: add missing folio_queue entry
folio_queue: fix documentation
netfs: Fix a KMSAN uninit-value error in netfs_clear_buffer
iov_iter: fix advancing slot in iter_folioq_get_pages()
net: mana: Add get_link and get_link_ksettings in ethtool
Add support for the ethtool get_link and get_link_ksettings
operations. Display standard port information using ethtool.
Before the change:
$ethtool enP30832s1
> No data available
After the change:
$ethtool enP30832s1
> Settings for enP30832s1:
Supported ports: [ ]
Supported link modes: Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes: Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: Unknown!
Duplex: Full
Auto-negotiation: off
Port: Other
PHYAD: 0
Transceiver: internal
Link detected: yes
Xin Long [Mon, 30 Sep 2024 20:49:51 +0000 (16:49 -0400)]
sctp: set sk_state back to CLOSED if autobind fails in sctp_listen_start
In sctp_listen_start() invoked by sctp_inet_listen(), it should set the
sk_state back to CLOSED if sctp_autobind() fails due to whatever reason.
Otherwise, next time when calling sctp_inet_listen(), if sctp_sk(sk)->reuse
is already set via setsockopt(SCTP_REUSE_PORT), sctp_sk(sk)->bind_hash will
be dereferenced as sk_state is LISTENING, which causes a crash as bind_hash
is NULL.
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
RIP: 0010:sctp_inet_listen+0x7f0/0xa20 net/sctp/socket.c:8617
Call Trace:
<TASK>
__sys_listen_socket net/socket.c:1883 [inline]
__sys_listen+0x1b7/0x230 net/socket.c:1894
__do_sys_listen net/socket.c:1902 [inline]
Add missing reg minItems as based on current binding document
only ethernet MAC IO space is a supported configuration.
There is a bug in schema, current examples contain 64-bit
addressing as well as 32-bit addressing. The schema validation
does pass incidentally considering one 64-bit reg address as
two 32-bit reg address entries. If we change axi_ethernet_eth1
example node reg addressing to 32-bit schema validation reports:
Documentation/devicetree/bindings/net/xlnx,axi-ethernet.example.dtb:
ethernet@40000000: reg: [[1073741824, 262144]] is too short
To fix it add missing reg minItems constraints and to make things clearer
stick to 32-bit addressing in examples.
Sean Anderson [Mon, 30 Sep 2024 15:39:54 +0000 (11:39 -0400)]
doc: net: napi: Update documentation for napi_schedule_irqoff
Since commit 8380c81d5c4f ("net: Treat __napi_schedule_irqoff() as
__napi_schedule() on PREEMPT_RT"), napi_schedule_irqoff will do the
right thing if IRQs are threaded. Therefore, there is no need to use
IRQF_NO_THREAD.
net: mana: Increase the DEF_RX_BUFFERS_PER_QUEUE to 1024
Through some experiments, we found out that increasing the default
RX buffers count from 512 to 1024, gives slightly better throughput
and significantly reduces the no_wqe_rx errs on the receiver side.
Along with these, other parameters like cpu usage, retrans seg etc
also show some improvement with 1024 value.
Darrick J. Wong [Wed, 2 Oct 2024 15:02:13 +0000 (08:02 -0700)]
iomap: constrain the file range passed to iomap_file_unshare
File contents can only be shared (i.e. reflinked) below EOF, so it makes
no sense to try to unshare ranges beyond EOF. Constrain the file range
parameters here so that we don't have to do that in the callers.
Darrick J. Wong [Wed, 2 Oct 2024 15:00:40 +0000 (08:00 -0700)]
iomap: don't bother unsharing delalloc extents
If unshare encounters a delalloc reservation in the srcmap, that means
that the file range isn't shared because delalloc reservations cannot be
reflinked. Therefore, don't try to unshare them.
Fix the following warning when the driver is compiled as built-in:
warning: unused import: `DeviceId`
--> drivers/net/phy/qt2025.rs:18:5
|
18 | DeviceId, Driver,
| ^^^^^^^^
|
= note: `#[warn(unused_imports)]` on by default
device_table in module_phy_driver macro is defined only when the
driver is built as a module. Use phy::DeviceId in the macro instead of
importing `DeviceId` since `phy` is always used.
First, sorry for the bland series subject - this is the first in a
number of cleanup series to the XPCS driver. This series has some
functional changes beyond merely cleanups, notably the first patch.
This series starts off with a patch that moves the PCS reset from
the xpcs_create*() family of calls to when phylink first configures
the PHY. The motivation for this change is to get rid of the
interface argument to the xpcs_create*() functions, which I see as
unnecessary complexity. This patch should be tested on Wangxun
and STMMAC drivers.
Patch 2 removes the now unnecessary interface argument from the
internal xpcs_create() and xpcs_init_iface() functions. With this,
xpcs_init_iface() becomes a misnamed function, but patch 3 removes
this function, moving its now meager contents to xpcs_create().
Patch 4 adds xpcs_destroy_pcs() and xpcs_create_pcs_mdiodev()
functions which return and take a phylink_pcs, allowing SJA1105
and Wangxun drivers to be converted to using the phylink_pcs
structure internally.
Patches 5 through 8 convert both these drivers to that end.
Patch 9 drops the interface argument from the remaining xpcs_create*()
functions, addressing the only remaining caller of these functions,
that being the STMMAC driver.
As patch 7 removed the direct calls to the XPCS config/link-up
functions, the last patch makes these functions static.
====================
net: pcs: xpcs: drop interface argument from xpcs_create*()
The XPCS sub-driver no longer uses the "interface" argument to the
xpcs_create_mdiodev() and xpcs_create_fwnode() functions. Remove
this now unnecessary argument, updating the stmmac driver
appropriately.
Use xpcs_create_pcs_mdiodev() to create the XPCS instance, storing
and using the phylink_pcs pointer internally, rather than dw_xpcs.
Use xpcs_destroy_pcs() to destroy the XPCS instance when we've
finished with it.
The static configuration reload saves the port speed in the static
configuration tables by first converting it from the internal
respresentation to the SPEED_xxx ethtool representation, and then
converts it back to restore the setting. This is because
sja1105_adjust_port_config() takes the speed as SPEED_xxx.
However, this is unnecessarily complex. If we split
sja1105_adjust_port_config() up, we can simply save and restore the
mac[port].speed member in the static configuration tables.
Use xpcs_create_pcs_mdiodev() to create the XPCS instance, storing
and using the phylink_pcs pointer internally, rather than dw_xpcs.
Use xpcs_destroy_pcs() to destroy the XPCS instance when we've
finished with it.
net: pcs: xpcs: add xpcs_destroy_pcs() and xpcs_create_pcs_mdiodev()
Provide xpcs create/destroy functions that return and take a phylink_pcs
pointer instead of an xpcs pointer. This will be used by drivers that
have been converted to use phylink_pcs pointers internally, rather than
dw_xpcs pointers.
As xpcs_create_mdiodev() no longer makes use of its interface argument,
pass PHY_INTERFACE_MODE_NA into xpcs_create_mdiodev() until it is
removed later in the series.
xpcs_init_iface() no longer does anything with the interface mode, and
now merely does configuration related to the PMA ID. Move this back
into xpcs_create() as it doesn't warrant being a separate function
anymore.
net: pcs: xpcs: move PCS reset to .pcs_pre_config()
Move the PCS reset to .pcs_pre_config() rather than at creation time,
which means we call the reset function with the interface that we're
actually going to be using to talk to the downstream device.
gso: fix udp gso fraglist segmentation after pull from frag_list
Detect gso fraglist skbs with corrupted geometry (see below) and
pass these to skb_segment instead of skb_segment_list, as the first
can segment them correctly.
Valid SKB_GSO_FRAGLIST skbs
- consist of two or more segments
- the head_skb holds the protocol headers plus first gso_size
- one or more frag_list skbs hold exactly one segment
- all but the last must be gso_size
Optional datapath hooks such as NAT and BPF (bpf_skb_pull_data) can
modify these skbs, breaking these invariants.
In extreme cases they pull all data into skb linear. For UDP, this
causes a NULL ptr deref in __udpv4_gso_segment_list_csum at
udp_hdr(seg->next)->dest.
Detect invalid geometry due to pull, by checking head_skb size.
Don't just drop, as this may blackhole a destination. Convert to be
able to pass to regular skb_segment.
Daniel Golle [Tue, 1 Oct 2024 00:17:18 +0000 (01:17 +0100)]
net: phy: mxl-gpy: add basic LED support
Add basic support for LEDs connected to MaxLinear GPY2xx and GPY115 PHYs.
The PHYs allow up to 4 LEDs to be connected.
Implement controlling LEDs in software as well as netdev trigger offloading
and LED polarity setup.
The hardware claims to support 16 PWM brightness levels but there is no
documentation on how to use that feature, hence this is not supported.
bridge: mcast: Fail MDB get request on empty entry
When user space deletes a port from an MDB entry, the port is removed
synchronously. If this was the last port in the entry and the entry is
not joined by the host itself, then the entry is scheduled for deletion
via a timer.
The above means that it is possible for the MDB get netlink request to
retrieve an empty entry which is scheduled for deletion. This is
problematic as after deleting the last port in an entry, user space
cannot rely on a non-zero return code from the MDB get request as an
indication that the port was successfully removed.
Fix by returning an error when the entry's port list is empty and the
entry is not joined by the host.
Hui Wang [Fri, 27 Sep 2024 11:46:10 +0000 (19:46 +0800)]
net: phy: realtek: Check the index value in led_hw_control_get
Just like rtl8211f_led_hw_is_supported() and
rtl8211f_led_hw_control_set(), the rtl8211f_led_hw_control_get() also
needs to check the index value, otherwise the caller is likely to get
an incorrect rules.
Eric Dumazet [Fri, 27 Sep 2024 07:45:53 +0000 (07:45 +0000)]
ppp: do not assume bh is held in ppp_channel_bridge_input()
Networking receive path is usually handled from BH handler.
However, some protocols need to acquire the socket lock, and
packets might be stored in the socket backlog is the socket was
owned by a user process.
In this case, release_sock(), __release_sock(), and sk_backlog_rcv()
might call the sk->sk_backlog_rcv() handler in process context.
sybot caught ppp was not considering this case in
ppp_channel_bridge_input() :
WARNING: inconsistent lock state 6.11.0-rc7-syzkaller-g5f5673607153 #0 Not tainted
--------------------------------
inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
ksoftirqd/1/24 [HC0[0]:SC1[1]:HE1:SE0] takes: ffff0000db7f11e0 (&pch->downl){+.?.}-{2:2}, at: spin_lock include/linux/spinlock.h:351 [inline] ffff0000db7f11e0 (&pch->downl){+.?.}-{2:2}, at: ppp_channel_bridge_input drivers/net/ppp/ppp_generic.c:2272 [inline] ffff0000db7f11e0 (&pch->downl){+.?.}-{2:2}, at: ppp_input+0x16c/0x854 drivers/net/ppp/ppp_generic.c:2304
{SOFTIRQ-ON-W} state was registered at:
lock_acquire+0x240/0x728 kernel/locking/lockdep.c:5759
__raw_spin_lock include/linux/spinlock_api_smp.h:133 [inline]
_raw_spin_lock+0x48/0x60 kernel/locking/spinlock.c:154
spin_lock include/linux/spinlock.h:351 [inline]
ppp_channel_bridge_input drivers/net/ppp/ppp_generic.c:2272 [inline]
ppp_input+0x16c/0x854 drivers/net/ppp/ppp_generic.c:2304
pppoe_rcv_core+0xfc/0x314 drivers/net/ppp/pppoe.c:379
sk_backlog_rcv include/net/sock.h:1111 [inline]
__release_sock+0x1a8/0x3d8 net/core/sock.c:3004
release_sock+0x68/0x1b8 net/core/sock.c:3558
pppoe_sendmsg+0xc8/0x5d8 drivers/net/ppp/pppoe.c:903
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg net/socket.c:745 [inline]
__sys_sendto+0x374/0x4f4 net/socket.c:2204
__do_sys_sendto net/socket.c:2216 [inline]
__se_sys_sendto net/socket.c:2212 [inline]
__arm64_sys_sendto+0xd8/0xf8 net/socket.c:2212
__invoke_syscall arch/arm64/kernel/syscall.c:35 [inline]
invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49
el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132
do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151
el0_svc+0x54/0x168 arch/arm64/kernel/entry-common.c:712
el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:730
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598
irq event stamp: 282914
hardirqs last enabled at (282914): [<ffff80008b42e30c>] __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:151 [inline]
hardirqs last enabled at (282914): [<ffff80008b42e30c>] _raw_spin_unlock_irqrestore+0x38/0x98 kernel/locking/spinlock.c:194
hardirqs last disabled at (282913): [<ffff80008b42e13c>] __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:108 [inline]
hardirqs last disabled at (282913): [<ffff80008b42e13c>] _raw_spin_lock_irqsave+0x2c/0x7c kernel/locking/spinlock.c:162
softirqs last enabled at (282904): [<ffff8000801f8e88>] softirq_handle_end kernel/softirq.c:400 [inline]
softirqs last enabled at (282904): [<ffff8000801f8e88>] handle_softirqs+0xa3c/0xbfc kernel/softirq.c:582
softirqs last disabled at (282909): [<ffff8000801fbdf8>] run_ksoftirqd+0x70/0x158 kernel/softirq.c:928
other info that might help us debug this:
Possible unsafe locking scenario:
Hangbin Liu [Fri, 27 Sep 2024 04:13:49 +0000 (12:13 +0800)]
selftests: rds: move include.sh to TEST_FILES
The include.sh file is generated for inclusion and should not be executable.
Otherwise, it will be added to kselftest-list.txt. Additionally, add the
executable bit for test.py at the same time to ensure proper functionality.
Eric Dumazet [Thu, 26 Sep 2024 16:58:36 +0000 (16:58 +0000)]
net: test for not too small csum_start in virtio_net_hdr_to_skb()
syzbot was able to trigger this warning [1], after injecting a
malicious packet through af_packet, setting skb->csum_start and thus
the transport header to an incorrect value.
We can at least make sure the transport header is after
the end of the network header (with a estimated minimal size).
Felix Fietkau [Thu, 26 Sep 2024 08:53:14 +0000 (10:53 +0200)]
net: gso: fix tcp fraglist segmentation after pull from frag_list
Detect tcp gso fraglist skbs with corrupted geometry (see below) and
pass these to skb_segment instead of skb_segment_list, as the first
can segment them correctly.
Valid SKB_GSO_FRAGLIST skbs
- consist of two or more segments
- the head_skb holds the protocol headers plus first gso_size
- one or more frag_list skbs hold exactly one segment
- all but the last must be gso_size
Optional datapath hooks such as NAT and BPF (bpf_skb_pull_data) can
modify these skbs, breaking these invariants.
In extreme cases they pull all data into skb linear. For TCP, this
causes a NULL ptr deref in __tcpv4_gso_segment_list_csum at
tcp_hdr(seg->next).
Detect invalid geometry due to pull, by checking head_skb size.
Don't just drop, as this may blackhole a destination. Convert to be
able to pass to regular skb_segment.
Approach and description based on a patch by Willem de Bruijn.
Jakub Kicinski [Thu, 3 Oct 2024 00:14:52 +0000 (17:14 -0700)]
Merge tag 'mlx5-fixes-2024-09-25' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2024-09-25
* tag 'mlx5-fixes-2024-09-25' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5e: Fix crash caused by calling __xfrm_state_delete() twice
net/mlx5e: SHAMPO, Fix overflow of hd_per_wq
net/mlx5: HWS, changed E2BIG error to a negative return code
net/mlx5: HWS, fixed double-free in error flow of creating SQ
net/mlx5: Fix wrong reserved field in hca_cap_2 in mlx5_ifc
net/mlx5e: Fix NULL deref in mlx5e_tir_builder_alloc()
net/mlx5: Added cond_resched() to crdump collection
net/mlx5: Fix error path in multi-packet WQE transmit
====================
Jakub Kicinski [Thu, 3 Oct 2024 00:09:52 +0000 (17:09 -0700)]
Merge tag 'for-net-2024-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- btmrvl: Use IRQF_NO_AUTOEN flag in request_irq()
- MGMT: Fix possible crash on mgmt_index_removed
- L2CAP: Fix uaf in l2cap_connect
- Bluetooth: hci_event: Align BR/EDR JUST_WORKS paring with LE
* tag 'for-net-2024-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: hci_event: Align BR/EDR JUST_WORKS paring with LE
Bluetooth: btmrvl: Use IRQF_NO_AUTOEN flag in request_irq()
Bluetooth: L2CAP: Fix uaf in l2cap_connect
Bluetooth: MGMT: Fix possible crash on mgmt_index_removed
====================
Jakub Kicinski [Thu, 3 Oct 2024 00:07:00 +0000 (17:07 -0700)]
Merge tag 'ieee802154-for-net-2024-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wpan/wpan
Stefan Schmidt says:
====================
pull-request: ieee802154 for net 2024-09-27
Jinjie Ruan added the use of IRQF_NO_AUTOEN in the mcr20a driver and fixed
and addiotinal build dependency problem while doing so.
Jiawei Ye, ensured a correct RCU handling in mac802154_scan_worker.
* tag 'ieee802154-for-net-2024-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wpan/wpan:
net: ieee802154: mcr20a: Use IRQF_NO_AUTOEN flag in request_irq()
mac802154: Fix potential RCU dereference issue in mac802154_scan_worker
ieee802154: Fix build error
====================
Linus Torvalds [Wed, 2 Oct 2024 23:42:28 +0000 (16:42 -0700)]
Merge tag 'pull-work.unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull generic unaligned.h cleanups from Al Viro:
"Get rid of architecture-specific <asm/unaligned.h> includes, replacing
them with a single generic <linux/unaligned.h> header file.
It's the second largest (after asm/io.h) class of asm/* includes, and
all but two architectures actually end up using exact same file.
Massage the remaining two (arc and parisc) to do the same and just
move the thing to from asm-generic/unaligned.h to linux/unaligned.h"
[ This is one of those things that we're better off doing outside the
merge window, and would only cause extra conflict noise if it was in
linux-next for the next release due to all the trivial #include line
updates. Rip off the band-aid. - Linus ]
* tag 'pull-work.unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
move asm/unaligned.h to linux/unaligned.h
arc: get rid of private asm/unaligned.h
parisc: get rid of private asm/unaligned.h
Al Viro [Tue, 1 Oct 2024 19:35:57 +0000 (15:35 -0400)]
move asm/unaligned.h to linux/unaligned.h
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.
auto-generated by the following:
for i in `git grep -l -w asm/unaligned.h`; do
sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
Al Viro [Wed, 6 Dec 2023 02:53:22 +0000 (21:53 -0500)]
arc: get rid of private asm/unaligned.h
Declarations local to arch/*/kernel/*.c are better off *not* in a public
header - arch/arc/kernel/unaligned.h is just fine for those
bits.
Unlike the parisc case, here we have an extra twist - asm/mmu.h
has an implicit dependency on struct pt_regs, and in some users
that used to be satisfied by include of asm/ptrace.h from
asm/unaligned.h (note that asm/mmu.h itself did _not_ pull asm/unaligned.h
- it relied upon the users having pulled asm/unaligned.h before asm/mmu.h
got there).
Seeing that asm/mmu.h only wants struct pt_regs * arguments in
an extern, just pre-declare it there - less brittle that way.
With that done _all_ asm/unaligned.h instances are reduced to include
of asm-generic/unaligned.h and can be removed - unaligned.h is in
mandatory-y in include/asm-generic/Kbuild.
What's more, we can move asm-generic/unaligned.h to linux/unaligned.h
and switch includes of <asm/unaligned.h> to <linux/unaligned.h>; that's
better off as an auto-generated commit, though, to be done by Linus
at -rc1 time next cycle.
Linus Torvalds [Wed, 2 Oct 2024 19:18:02 +0000 (12:18 -0700)]
Merge tag 'input-for-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:
- a couple fixups for adp5589-keys driver
- recently added driver for PixArt PS/2 touchpads is dropped
temporarily because its detection routine is too greedy and
mis-identifies devices from other vendors as PixArt devices
* tag 'input-for-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: adp5589-keys - fix adp5589_gpio_get_value()
Input: adp5589-keys - fix NULL pointer dereference
Revert "Input: Add driver for PixArt PS/2 touchpad"
Linus Torvalds [Wed, 2 Oct 2024 19:05:13 +0000 (12:05 -0700)]
Merge tag 'for-6.12/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mikulas Patocka:
"Revert the patch that made dm-verity restart or panic on I/O errors,
and instead add new explicit options for people who want that
behavior"
* tag 'for-6.12/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm-verity: introduce the options restart_on_error and panic_on_error
Revert: "dm-verity: restart or panic on an I/O error"
David Howells [Wed, 2 Oct 2024 14:45:50 +0000 (15:45 +0100)]
netfs: Fix missing wakeup after issuing writes
After dividing up a proposed write into subrequests, netfslib sets
NETFS_RREQ_ALL_QUEUED to indicate to the collector that it can move on to
the final cleanup once it has emptied the subrequest queues.
Now, whilst the collector will normally end up running at least once after
this bit is set just because it takes a while to process all the write
subrequests before the collector runs out of subrequests, there exists the
possibility that the issuing thread will be forced to sleep and the
collector thread will clean up all the subrequests before ALL_QUEUED gets
set.
In such a case, the collector thread will not get triggered again and will
never clear NETFS_RREQ_IN_PROGRESS thus leaving a request uncompleted and
causing a potential futute hang.
Fix this by scheduling the write collector if all the subrequest queues are
empty (and thus no writes pending issuance).
Note that we'd do this ideally before queuing the subrequest, but in the
case of buffered writeback, at least, we can't find out that we've run out
of folios until after we've called writeback_iter() and it has returned
NULL - at which point we might not actually have any subrequests still
under construction.
The problem that the commit e6a3531dd542cb127c8de32ab1e54a48ae19962b
fixes was reported as a security bug, but Google engineers working on
Android and ChromeOS didn't want to change the default behavior, they
want to get -EIO rather than restarting the system, so I am reverting
that commit.
Note also that calling machine_restart from the I/O handling code is
potentially unsafe (the reboot notifiers may wait for the bio that
triggered the restart), but Android uses the reboot notifiers to store
the reboot reason into the PMU microcontroller, so machine_restart must
be used.
Al Viro [Wed, 6 Dec 2023 02:53:22 +0000 (21:53 -0500)]
parisc: get rid of private asm/unaligned.h
Declarations local to arch/*/kernel/*.c are better off *not* in a public
header - arch/parisc/kernel/unaligned.h is just fine for those
bits.
With that done parisc asm/unaligned.h is reduced to include
of asm-generic/unaligned.h and can be removed - unaligned.h is in
mandatory-y in include/asm-generic/Kbuild.
ksmbd: Annotate struct copychunk_ioctl_req with __counted_by_le()
Add the __counted_by_le compiler attribute to the flexible array member
Chunks to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Change the data type of the flexible array member Chunks from __u8[] to
struct srv_copychunk[] for ChunkCount to match the number of elements in
the Chunks array. (With __u8[], each srv_copychunk would occupy 24 array
entries and the __counted_by compiler attribute wouldn't be applicable.)
Use struct_size() to calculate the size of the copychunk_ioctl_req.
Read Chunks[0] after checking that ChunkCount is not 0.
The adp5589 seems to have the same behavior as similar devices as
explained in commit 910a9f5636f5 ("Input: adp5588-keys - get value from
data out when dir is out").
Basically, when the gpio is set as output we need to get the value from
ADP5589_GPO_DATA_OUT_A register instead of ADP5589_GPI_STATUS_A.
We register a devm action to call adp5589_clear_config() and then pass
the i2c client as argument so that we can call i2c_get_clientdata() in
order to get our device object. However, i2c_set_clientdata() is only
being set at the end of the probe function which means that we'll get a
NULL pointer dereference in case the probe function fails early.
Anton Danilov [Tue, 24 Sep 2024 23:51:59 +0000 (02:51 +0300)]
ipv4: ip_gre: Fix drops of small packets in ipgre_xmit
Regression Description:
Depending on the options specified for the GRE tunnel device, small
packets may be dropped. This occurs because the pskb_network_may_pull
function fails due to the packet's insufficient length.
For example, if only the okey option is specified for the tunnel device,
original (before encapsulation) packets smaller than 28 bytes (including
the IPv4 header) will be dropped. This happens because the required
length is calculated relative to the network header, not the skb->head.
Here is how the required length is computed and checked:
* The pull_len variable is set to 28 bytes, consisting of:
* IPv4 header: 20 bytes
* GRE header with Key field: 8 bytes
* The pskb_network_may_pull function adds the network offset, shifting
the checkable space further to the beginning of the network header and
extending it to the beginning of the packet. As a result, the end of
the checkable space occurs beyond the actual end of the packet.
Instead of ensuring that 28 bytes are present in skb->head, the function
is requesting these 28 bytes starting from the network header. For small
packets, this requested length exceeds the actual packet size, causing
the check to fail and the packets to be dropped.
This issue affects both locally originated and forwarded packets in
DMVPN-like setups.
How to reproduce (for local originated packets):
ip link add dev gre1 type gre ikey 1.9.8.4 okey 1.9.8.4 \
local <your-ip> remote 0.0.0.0
ip link set mtu 1400 dev gre1
ip link set up dev gre1
ip address add 192.168.13.1/24 dev gre1
ip neighbor add 192.168.13.2 lladdr <remote-ip> dev gre1
ping -s 1374 -c 10 192.168.13.2
tcpdump -vni gre1
tcpdump -vni <your-ext-iface> 'ip proto 47'
ip -s -s -d link show dev gre1
Solution:
Use the pskb_may_pull function instead the pskb_network_may_pull.
Shenwei Wang [Tue, 24 Sep 2024 20:54:24 +0000 (15:54 -0500)]
net: stmmac: dwmac4: extend timeout for VLAN Tag register busy bit check
Increase the timeout for checking the busy bit of the VLAN Tag register
from 10µs to 500ms. This change is necessary to accommodate scenarios
where Energy Efficient Ethernet (EEE) is enabled.
Overnight testing revealed that when EEE is active, the busy bit can
remain set for up to approximately 300ms. The new 500ms timeout provides
a safety margin.
iov_iter: fix advancing slot in iter_folioq_get_pages()
iter_folioq_get_pages() decides to advance to the next folioq slot when
it has reached the end of the current folio. However, it is checking
offset, which is the beginning of the current part, instead of
iov_offset, which is adjusted to the end of the current part, so it
doesn't advance the slot when it's supposed to. As a result, on the next
iteration, we'll use the same folio with an out-of-bounds offset and
return an unrelated page.
This manifested as various crashes and other failures in 9pfs in drgn's
VM testing setup and BPF CI.
Eric Dumazet [Tue, 24 Sep 2024 15:02:56 +0000 (15:02 +0000)]
net: avoid potential underflow in qdisc_pkt_len_init() with UFO
After commit 7c6d2ecbda83 ("net: be more gentle about silly gso
requests coming from user") virtio_net_hdr_to_skb() had sanity check
to detect malicious attempts from user space to cook a bad GSO packet.
Then commit cf9acc90c80ec ("net: virtio_net_hdr_to_skb: count
transport header in UFO") while fixing one issue, allowed user space
to cook a GSO packet with the following characteristic :
IPv4 SKB_GSO_UDP, gso_size=3, skb->len = 28.
When this packet arrives in qdisc_pkt_len_init(), we end up
with hdr_len = 28 (IPv4 header + UDP header), matching skb->len
There is no need to ask the user about enabling Microchip FDMA
functionality, as all drivers that use it select the FDMA symbol.
Hence make the symbol invisible, unless when compile-testing.
net: fec: Reload PTP registers after link-state change
On link-state change, the controller gets reset,
which clears all PTP registers, including PHC time,
calibrated clock correction values etc. For correct
IEEE 1588 operation we need to restore these after
the reset.
When applying padding, the buffer is not zeroed, which results in memory
disclosure. The mentioned data is observed on the wire. This patch uses
skb_put_padto() to pad Ethernet frames properly. The mentioned function
zeroes the expanded buffer.
In case the packet cannot be padded it is silently dropped. Statistics
are also not incremented. This driver does not support statistics in the
old 32-bit format or the new 64-bit format. These will be added in the
future. In its current form, the patch should be easily backported to
stable versions.
Ethernet MACs on Amazon-SE and Danube cannot do padding of the packets
in hardware, so software padding must be applied.
Daniel Borkmann [Mon, 23 Sep 2024 21:22:42 +0000 (23:22 +0200)]
net: Fix gso_features_check to check for both dev->gso_{ipv4_,}max_size
Commit 24ab059d2ebd ("net: check dev->gso_max_size in gso_features_check()")
added a dev->gso_max_size test to gso_features_check() in order to fall
back to GSO when needed.
This was added as it was noticed that some drivers could misbehave if TSO
packets get too big. However, the check doesn't respect dev->gso_ipv4_max_size
limit. For instance, a device could be configured with BIG TCP for IPv4,
but not IPv6.
Therefore, add a netif_get_gso_max_size() equivalent to netif_get_gro_max_size()
and use the helper to respect both limits before falling back to GSO engine.
Vladimir Oltean [Fri, 13 Sep 2024 20:35:49 +0000 (23:35 +0300)]
net: dsa: improve shutdown sequence
Alexander Sverdlin presents 2 problems during shutdown with the
lan9303 driver. One is specific to lan9303 and the other just happens
to reproduce there.
The first problem is that lan9303 is unique among DSA drivers in that it
calls dev_get_drvdata() at "arbitrary runtime" (not probe, not shutdown,
not remove):
But we never stop the phy_state_machine(), so it may continue to run
after dsa_switch_shutdown(). Our common pattern in all DSA drivers is
to set drvdata to NULL to suppress the remove() method that may come
afterwards. But in this case it will result in an NPD.
The second problem is that the way in which we set
dp->conduit->dsa_ptr = NULL; is concurrent with receive packet
processing. dsa_switch_rcv() checks once whether dev->dsa_ptr is NULL,
but afterwards, rather than continuing to use that non-NULL value,
dev->dsa_ptr is dereferenced again and again without NULL checks:
dsa_conduit_find_user() and many other places. In between dereferences,
there is no locking to ensure that what was valid once continues to be
valid.
Both problems have the common aspect that closing the conduit interface
solves them.
In the first case, dev_close(conduit) triggers the NETDEV_GOING_DOWN
event in dsa_user_netdevice_event() which closes user ports as well.
dsa_port_disable_rt() calls phylink_stop(), which synchronously stops
the phylink state machine, and ds->ops->phy_read() will thus no longer
call into the driver after this point.
In the second case, dev_close(conduit) should do this, as per
Documentation/networking/driver.rst:
| Quiescence
| ----------
|
| After the ndo_stop routine has been called, the hardware must
| not receive or transmit any data. All in flight packets must
| be aborted. If necessary, poll or wait for completion of
| any reset commands.
So it should be sufficient to ensure that later, when we zeroize
conduit->dsa_ptr, there will be no concurrent dsa_switch_rcv() call
on this conduit.
The addition of the netif_device_detach() function is to ensure that
ioctls, rtnetlinks and ethtool requests on the user ports no longer
propagate down to the driver - we're no longer prepared to handle them.
The race condition actually did not exist when commit 0650bf52b31f
("net: dsa: be compatible with masters which unregister on shutdown")
first introduced dsa_switch_shutdown(). It was created later, when we
stopped unregistering the user interfaces from a bad spot, and we just
replaced that sequence with a racy zeroization of conduit->dsa_ptr
(one which doesn't ensure that the interfaces aren't up).
Merge tag 'sched_ext-for-6.12-rc1-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- When sched_ext is in bypass mode (e.g. while disabling the BPF
scheduler), it was using one DSQ to implement global FIFO scheduling
as all it has to do is guaranteeing reasonable forward progress.
On multi-socket machines, this can lead to live-lock conditions under
certain workloads. Fixed by splitting the queue used for FIFO
scheduling per NUMA node. This required several preparation patches.
- Hotplug tests on powerpc could reliably trigger deadlock while
enabling a BPF scheduler.
This was caused by cpu_hotplug_lock nesting inside scx_fork_rwsem and
then CPU hotplug path trying to fork a new thread while holding
cpu_hotplug_lock.
Fixed by restructuring locking in enable and disable paths so that
the two locks are not coupled. This required several preparation
patches which also fixed a couple other issues in the enable path.
- A build fix for !CONFIG_SMP
- Userspace tooling sync and updates
* tag 'sched_ext-for-6.12-rc1-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Remove redundant p->nr_cpus_allowed checker
sched_ext: Decouple locks in scx_ops_enable()
sched_ext: Decouple locks in scx_ops_disable_workfn()
sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online()
sched_ext: Enable scx_ops_init_task() separately
sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable()
sched_ext: Initialize in bypass mode
sched_ext: Remove SCX_OPS_PREPPING
sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable()
sched_ext: Use shorter slice while bypassing
sched_ext: Split the global DSQ per NUMA node
sched_ext: Relocate find_user_dsq()
sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new()
scx_flatcg: Use a user DSQ for fallback instead of SCX_DSQ_GLOBAL
tools/sched_ext: Receive misc updates from SCX repo
sched_ext: Add __COMPAT helpers for features added during v6.12 devel cycle
sched_ext: Build fix for !CONFIG_SMP
Merge tag 'vfs-6.12-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"afs:
- Fix setting of the server responding flag
- Remove unused struct afs_address_list and afs_put_address_list()
function
- Fix infinite loop because of unresponsive servers
- Ensure that afs_retry_request() function is correctly added to the
afs_req_ops netfs operations table
netfs:
- Fix netfs_folio tracepoint handling to handle NULL mappings
- Add a missing folio_queue API documentation
- Ensure that netfs_write_folio() correctly advances the iterator via
iov_iter_advance()
- Fix a dentry leak during concurrent cull and cookie lookup
operations in cachefiles
pidfs:
- Correctly handle accessing another task's pid namespace"
* tag 'vfs-6.12-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
netfs: Fix the netfs_folio tracepoint to handle NULL mapping
netfs: Add folio_queue API documentation
netfs: Advance iterator correctly rather than jumping it
afs: Fix the setting of the server responding flag
afs: Remove unused struct and function prototype
afs: Fix possible infinite loop with unresponsive servers
pidfs: check for valid pid namespace
afs: Fix missing wire-up of afs_retry_request()
cachefiles: fix dentry leak in cachefiles_open_file()
xol_add_vma() maps the uninitialized page allocated by __create_xol_area()
into userspace. On some architectures (x86) this memory is readable even
without VM_READ, VM_EXEC results in the same pgprot_t as VM_EXEC|VM_READ,
although this doesn't really matter, debugger can read this memory anyway.
leading to a build error when neither KVM_INTEL nor KVM_AMD support is
enabled:
arch/x86/kvm/x86.c: In function ‘kvm_arch_enable_virtualization’:
arch/x86/kvm/x86.c:12517:9: error: implicit declaration of function ‘cpu_emergency_register_virt_callback’ [-Wimplicit-function-declaration]
12517 | cpu_emergency_register_virt_callback(kvm_x86_ops.emergency_disable_virtualization_cpu);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/x86/kvm/x86.c: In function ‘kvm_arch_disable_virtualization’:
arch/x86/kvm/x86.c:12522:9: error: implicit declaration of function ‘cpu_emergency_unregister_virt_callback’ [-Wimplicit-function-declaration]
12522 | cpu_emergency_unregister_virt_callback(kvm_x86_ops.emergency_disable_virtualization_cpu);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix the build by defining empty helper functions the same way the old
cpu_emergency_disable_virtualization() function was dealt with for the
same situation.
Maybe we could instead have made the call sites conditional, since the
callers (kvm_arch_{en,dis}able_virtualization()) have an empty weak
fallback. I'll leave that to the kvm people to argue about, this at
least gets the build going for that particular config.
Merge tag 'mailbox-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jassibrar/mailbox
Pull mailbox updates from Jassi Brar:
- fix kconfig dependencies (mhu-v3, omap2+)
- use devie name instead of genereic imx_mu_chan as interrupt name
(imx)
- enable sa8255p and qcs8300 ipc controllers (qcom)
- Fix timeout during suspend mode (bcm2835)
- convert to use use of_property_match_string (mailbox)
- enable mt8188 (mediatek)
- use devm_clk_get_enabled helpers (spreadtrum)
- fix device-id typo (rockchip)
* tag 'mailbox-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jassibrar/mailbox:
mailbox, remoteproc: omap2+: fix compile testing
dt-bindings: mailbox: qcom-ipcc: Document QCS8300 IPCC
dt-bindings: mailbox: qcom-ipcc: document the support for SA8255p
dt-bindings: mailbox: mtk,adsp-mbox: Add compatible for MT8188
mailbox: Use of_property_match_string() instead of open-coding
mailbox: bcm2835: Fix timeout during suspend mode
mailbox: sprd: Use devm_clk_get_enabled() helpers
mailbox: rockchip: fix a typo in module autoloading
mailbox: imx: use device name in interrupt name
mailbox: ARM_MHU_V3 should depend on ARM64