Ido Schimmel [Mon, 25 Mar 2024 07:50:30 +0000 (09:50 +0200)]
selftests: vxlan_mdb: Fix failures with old libnet
Locally generated IP multicast packets (such as the ones used in the
test) do not perform routing and simply egress the bound device.
However, as explained in commit 8bcfb4ae4d97 ("selftests: forwarding:
Fix failing tests with old libnet"), old versions of libnet (used by
mausezahn) do not use the "SO_BINDTODEVICE" socket option. Specifically,
the library started using the option for IPv6 sockets in version 1.1.6
and for IPv4 sockets in version 1.2. This explains why on Ubuntu - which
uses version 1.1.6 - the IPv4 overlay tests are failing whereas the IPv6
ones are passing.
Fix by specifying the source and destination MAC of the packets which
will cause mausezahn to use a packet socket instead of an IP socket.
Sergey Shtylyov [Sun, 24 Mar 2024 20:40:09 +0000 (23:40 +0300)]
MAINTAINERS: split Renesas Ethernet drivers entry
Since the Renesas Ethernet Switch driver was added by Yoshihiro Shimoda,
I started receiving the patches to review for it -- which I was unable to
do, as I don't know this hardware and don't even have the manuals for it.
Fortunately, Shimoda-san has volunteered to be a reviewer for this new
driver, thus let's now split the single entry into 3 per-driver entries,
each with its own reviewer...
Arınç ÜNAL [Wed, 20 Mar 2024 20:45:30 +0000 (23:45 +0300)]
net: dsa: mt7530: fix improper frames on all 25MHz and 40MHz XTAL MT7530
The MT7530 switch after reset initialises with a core clock frequency that
works with a 25MHz XTAL connected to it. For 40MHz XTAL, the core clock
frequency must be set to 500MHz.
The mt7530_pll_setup() function is responsible of setting the core clock
frequency. Currently, it runs on MT7530 with 25MHz and 40MHz XTAL. This
causes MT7530 switch with 25MHz XTAL to egress and ingress frames
improperly.
Introduce a check to run it only on MT7530 with 40MHz XTAL.
The core clock frequency is set by writing to a switch PHY's register.
Access to the PHY's register is done via the MDIO bus the switch is also
on. Therefore, it works only when the switch makes switch PHYs listen on
the MDIO bus the switch is on. This is controlled either by the state of
the ESW_P1_LED_1 pin after reset deassertion or modifying bit 5 of the
modifiable trap register.
When ESW_P1_LED_1 is pulled high, PHY indirect access is used. That means
accessing PHY registers via the PHY indirect access control register of the
switch.
When ESW_P1_LED_1 is pulled low, PHY direct access is used. That means
accessing PHY registers via the MDIO bus the switch is on.
For MT7530 switch with 40MHz XTAL on a board with ESW_P1_LED_1 pulled high,
the core clock frequency won't be set to 500MHz, causing the switch to
egress and ingress frames improperly.
Run mt7530_pll_setup() after PHY direct access is set on the modifiable
trap register.
With these two changes, all MT7530 switches with 25MHz and 40MHz, and
P1_LED_1 pulled high or low, will egress and ingress frames properly.
The inclusion of io-64-nonatomic-lo-hi.h indicates that all 64bit
accesses can be replaced by pairs of nonatomic 32bit access. Fix
alignment by forcing all accesses to be 32bit on 64bit platforms.
Eric Dumazet [Fri, 22 Mar 2024 13:57:32 +0000 (13:57 +0000)]
tcp: properly terminate timers for kernel sockets
We had various syzbot reports about tcp timers firing after
the corresponding netns has been dismantled.
Fortunately Josef Bacik could trigger the issue more often,
and could test a patch I wrote two years ago.
When TCP sockets are closed, we call inet_csk_clear_xmit_timers()
to 'stop' the timers.
inet_csk_clear_xmit_timers() can be called from any context,
including when socket lock is held.
This is the reason it uses sk_stop_timer(), aka del_timer().
This means that ongoing timers might finish much later.
For user sockets, this is fine because each running timer
holds a reference on the socket, and the user socket holds
a reference on the netns.
For kernel sockets, we risk that the netns is freed before
timer can complete, because kernel sockets do not hold
reference on the netns.
This patch adds inet_csk_clear_xmit_timers_sync() function
that using sk_stop_timer_sync() to make sure all timers
are terminated before the kernel socket is released.
Modules using kernel sockets close them in their netns exit()
handler.
Also add sock_not_owned_by_me() helper to get LOCKDEP
support : inet_csk_clear_xmit_timers_sync() must not be called
while socket lock is held.
It is very possible we can revert in the future commit 3a58f13a881e ("net: rds: acquire refcount on TCP sockets")
which attempted to solve the issue in rds only.
(net/smc/af_smc.c and net/mptcp/subflow.c have similar code)
We probably can remove the check_net() tests from
tcp_out_of_resources() and __tcp_close() in the future.
Ravi Gunasekaran [Fri, 22 Mar 2024 10:04:47 +0000 (15:34 +0530)]
net: hsr: hsr_slave: Fix the promiscuous mode in offload mode
commit e748d0fd66ab ("net: hsr: Disable promiscuous mode in
offload mode") disables promiscuous mode of slave devices
while creating an HSR interface. But while deleting the
HSR interface, it does not take care of it. It decreases the
promiscuous mode count, which eventually enables promiscuous
mode on the slave devices when creating HSR interface again.
Fix this by not decrementing the promiscuous mode count while
deleting the HSR interface when offload is enabled.
Alexandra Winter [Thu, 21 Mar 2024 11:53:37 +0000 (12:53 +0100)]
s390/qeth: handle deferred cc1
The IO subsystem expects a driver to retry a ccw_device_start, when the
subsequent interrupt response block (irb) contains a deferred
condition code 1.
Symptoms before this commit:
On the read channel we always trigger the next read anyhow, so no
different behaviour here.
On the write channel we may experience timeout errors, because the
expected reply will never be received without the retry.
Other callers of qeth_send_control_data() may wrongly assume that the ccw
was successful, which may cause problems later.
Note that since
commit 2297791c92d0 ("s390/cio: dont unregister subchannel from child-drivers")
and
commit 5ef1dc40ffa6 ("s390/cio: fix invalid -EBUSY on ccw_device_start")
deferred CC1s are much more likely to occur. See the commit message of the
latter for more background information.
Pu Lehui [Sun, 24 Mar 2024 10:33:06 +0000 (10:33 +0000)]
riscv, bpf: Fix kfunc parameters incompatibility between bpf and riscv abi
We encountered a failing case when running selftest in no_alu32 mode:
The failure case is `kfunc_call/kfunc_call_test4` and its source code is
like bellow:
```
long bpf_kfunc_call_test4(signed char a, short b, int c, long d) __ksym;
int kfunc_call_test4(struct __sk_buff *skb)
{
...
tmp = bpf_kfunc_call_test4(-3, -30, -200, -1000);
...
}
```
insn 2 is parsed to ld_imm64 insn to emit 0x00000000ffffff38 imm, and
converted to int type and then send to bpf_kfunc_call_test4. But since
it is zero-extended in the bpf calling convention, riscv jit will
directly treat it as an unsigned 32-bit int value, and then fails with
the message "actual 4294966063 != expected -1234".
The reason is the incompatibility between bpf and riscv abi, that is,
bpf will do zero-extension on uint, but riscv64 requires sign-extension
on int or uint. We can solve this problem by sign extending the 32-bit
parameters in kfunc.
The issue is related to [0], and thanks to Yonghong and Alexei.
Kurt Kanzenbach [Wed, 13 Mar 2024 13:03:10 +0000 (14:03 +0100)]
igc: Remove stale comment about Tx timestamping
The initial igc Tx timestamping implementation used only one register for
retrieving Tx timestamps. Commit 3ed247e78911 ("igc: Add support for
multiple in-flight TX timestamps") added support for utilizing all four of
them e.g., for multiple domain support. Remove the stale comment/FIXME.
Przemek Kitszel [Tue, 5 Mar 2024 16:02:02 +0000 (17:02 +0100)]
ixgbe: avoid sleeping allocation in ixgbe_ipsec_vf_add_sa()
Change kzalloc() flags used in ixgbe_ipsec_vf_add_sa() to GFP_ATOMIC, to
avoid sleeping in IRQ context.
Dan Carpenter, with the help of Smatch, has found following issue:
The patch eda0333ac293: "ixgbe: add VF IPsec management" from Aug 13,
2018 (linux-next), leads to the following Smatch static checker
warning: drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c:917 ixgbe_ipsec_vf_add_sa()
warn: sleeping in IRQ context
The call tree that Smatch is worried about is:
ixgbe_msix_other() <- IRQ handler
-> ixgbe_msg_task()
-> ixgbe_rcv_msg_from_vf()
-> ixgbe_ipsec_vf_add_sa()
ice: fix memory corruption bug with suspend and rebuild
The ice driver would previously panic after suspend. This is caused
from the driver *only* calling the ice_vsi_free_q_vectors() function by
itself, when it is suspending. Since commit b3e7b3a6ee92 ("ice: prevent
NULL pointer deref during reload") the driver has zeroed out
num_q_vectors, and only restored it in ice_vsi_cfg_def().
This further causes the ice_rebuild() function to allocate a zero length
buffer, after which num_q_vectors is updated, and then the new value of
num_q_vectors is used to index into the zero length buffer, which
corrupts memory.
The fix entails making sure all the code referencing num_q_vectors only
does so after it has been reset via ice_vsi_cfg_def().
I didn't perform a full bisect, but I was able to test against 6.1.77
kernel and that ice driver works fine for suspend/resume with no panic,
so sometime since then, this problem was introduced.
Also clean up an un-needed init of a local variable in the function
being modified.
Steven Zou [Wed, 7 Feb 2024 01:49:59 +0000 (09:49 +0800)]
ice: Refactor FW data type and fix bitmap casting issue
According to the datasheet, the recipe association data is an 8-byte
little-endian value. It is described as 'Bitmap of the recipe indexes
associated with this profile', it is from 24 to 31 byte area in FW.
Therefore, it is defined to '__le64 recipe_assoc' in struct
ice_aqc_recipe_to_profile. And then fix the bitmap casting issue, as we
must never ever use castings for bitmap type.
Johannes Berg [Mon, 25 Mar 2024 16:43:32 +0000 (17:43 +0100)]
kunit: fix wireless test dependencies
For the wireless tests, CONFIG_WLAN and CONFIG_NETDEVICES are
needed, though seem to be available by default on ARCH=um, so
we didn't notice this before. Add them to fix kunit running
on other architectures.
Benjamin Berg [Wed, 20 Mar 2024 21:26:22 +0000 (23:26 +0200)]
wifi: iwlwifi: mvm: include link ID when releasing frames
When releasing frames from the reorder buffer, the link ID was not
included in the RX status information. This subsequently led mac80211 to
drop the frame. Change it so that the link information is set
immediately when possible so that it doesn't not need to be filled in
anymore when submitting the frame to mac80211.
Johannes Berg [Wed, 20 Mar 2024 21:26:32 +0000 (23:26 +0200)]
wifi: iwlwifi: mvm: handle debugfs names more carefully
With debugfs=off, we can get here with the dbgfs_dir being
an ERR_PTR(). Instead of checking for all this, which is
often flagged as a mistake, simply handle the names here
more carefully by printing them, then we don't need extra
checks.
Also, while checking, I noticed theoretically 'buf' is too
small, so fix that size as well.
Benjamin Berg [Wed, 20 Mar 2024 21:26:23 +0000 (23:26 +0200)]
wifi: iwlwifi: mvm: guard against invalid STA ID on removal
Guard against invalid station IDs in iwl_mvm_mld_rm_sta_id as that would
result in out-of-bounds array accesses. This prevents issues should the
driver get into a bad state during error handling.
Johannes Berg [Tue, 19 Mar 2024 08:10:22 +0000 (10:10 +0200)]
wifi: iwlwifi: read txq->read_ptr under lock
If we read txq->read_ptr without lock, we can read the same
value twice, then obtain the lock, and reclaim from there
to two different places, but crucially reclaim the same
entry twice, resulting in the WARN_ONCE() a little later.
Fix that by reading txq->read_ptr under lock.
Johannes Berg [Tue, 19 Mar 2024 08:10:20 +0000 (10:10 +0200)]
wifi: iwlwifi: fw: don't always use FW dump trig
Since the dump_data (struct iwl_fwrt_dump_data) is a union,
it's not safe to unconditionally access and use the 'trig'
member, it might be 'desc' instead. Access it only if it's
known to be 'trig' rather than 'desc', i.e. if ini-debug
is present.
Ayala Beker [Mon, 18 Mar 2024 16:53:22 +0000 (18:53 +0200)]
wifi: mac80211: correctly set active links upon TTLM
Fix ieee80211_ttlm_set_links() to not set all active links,
but instead let the driver know that valid links status changed
and select the active links properly.
Ilan Peer [Mon, 11 Mar 2024 06:28:05 +0000 (08:28 +0200)]
wifi: iwlwifi: mvm: Configure the link mapping for non-MLD FW
In the non MLD firmware flows, although the deflink is used, the mapping
of link ID to BSS configuration was missing, which causes flows that need
this mapping to crash.
Fix this by adding the link ID to BSS configuration mapping to non MLD
flows as well.
wifi: iwlwifi: mvm: pick the version of SESSION_PROTECTION_NOTIF
When we want to know whether we should look for the mac_id or the
link_id in struct iwl_mvm_session_prot_notif, we should look at the
version of SESSION_PROTECTION_NOTIF.
Johannes Berg [Mon, 18 Mar 2024 16:53:30 +0000 (18:53 +0200)]
wifi: mac80211: fix prep_connection error path
If prep_channel fails in prep_connection, the code releases
the deflink's chanctx, which is wrong since we may be using
a different link. It's already wrong to even do that always
though, since we might still have the station. Remove it
only if prep_channel succeeded and later updates fail.
Johannes Berg [Thu, 14 Mar 2024 10:09:52 +0000 (11:09 +0100)]
wifi: iwlwifi: mvm: disable MLO for the time being
MLO ended up not really fully stable yet, we want to make
sure it works well with the ecosystem before enabling it.
Thus, remove the flag, but set WIPHY_FLAG_DISABLE_WEXT so
we don't get wireless extensions back until we enable MLO
for this hardware.
Johannes Berg [Thu, 14 Mar 2024 10:09:51 +0000 (11:09 +0100)]
wifi: cfg80211: add a flag to disable wireless extensions
Wireless extensions are already disabled if MLO is enabled,
given that we cannot support MLO there with all the hard-
coded assumptions about BSSID etc.
However, the WiFi7 ecosystem is still stabilizing, and some
devices may need MLO disabled while that happens. In that
case, we might end up with a device that supports wext (but
not MLO) in one kernel, and then breaks wext in the future
(by enabling MLO), which is not desirable.
Add a flag to let such drivers/devices disable wext even if
MLO isn't yet enabled.
Running kernel-doc on ieee80211_i.h flagged the following:
net/mac80211/ieee80211_i.h:145: warning: expecting prototype for enum ieee80211_corrupt_data_flags. Prototype was for enum ieee80211_bss_corrupt_data_flags instead
net/mac80211/ieee80211_i.h:162: warning: expecting prototype for enum ieee80211_valid_data_flags. Prototype was for enum ieee80211_bss_valid_data_flags instead
Felix Fietkau [Sat, 16 Mar 2024 07:43:36 +0000 (08:43 +0100)]
wifi: mac80211: check/clear fast rx for non-4addr sta VLAN changes
When moving a station out of a VLAN and deleting the VLAN afterwards, the
fast_rx entry still holds a pointer to the VLAN's netdev, which can cause
use-after-free bugs. Fix this by immediately calling ieee80211_check_fast_rx
after the VLAN change.
Johan Hovold [Mon, 25 Mar 2024 08:59:48 +0000 (09:59 +0100)]
wifi: mac80211: fix mlme_link_id_dbg()
Make sure that the new mlme_link_id_dbg() macro honours
CONFIG_MAC80211_MLME_DEBUG as intended to avoid spamming the log with
messages like:
wlan0: no EHT support, limiting to HE
wlan0: determined local STA to be HE, BW limited to 160 MHz
wlan0: determined AP xx:xx:xx:xx:xx:xx to be VHT
wlan0: connecting with VHT mode, max bandwidth 160 MHz
fs/9p: fix uninitialized values during inode evict
If an iget fails due to not being able to retrieve information
from the server then the inode structure is only partially
initialized. When the inode gets evicted, references to
uninitialized structures (like fscache cookies) were being
made.
This patch checks for a bad_inode before doing anything other
than clearing the inode from the cache. Since the inode is
bad, it shouldn't have any state associated with it that needs
to be written back (and there really isn't a way to complete
those anyways).
tracing: probes: Fix to zero initialize a local variable
Fix to initialize 'val' local variable with zero.
Dan reported that Smatch static code checker reports an error that a local
'val' variable needs to be initialized. Actually, the 'val' is expected to
be initialized by FETCH_OP_ARG in the same loop, but it is not obvious. So
initialize it with zero.
Zoltan HERPAI [Wed, 20 Mar 2024 08:36:02 +0000 (09:36 +0100)]
pwm: img: fix pwm clock lookup
22e8e19 has introduced a regression in the imgchip->pwm_clk lookup, whereas
the clock name has also been renamed to "imgchip". This causes the driver
failing to load:
[ 0.546905] img-pwm 18101300.pwm: failed to get imgchip clock
[ 0.553418] img-pwm: probe of 18101300.pwm failed with error -2
Fix this lookup by reverting the clock name back to "pwm".
Colin Ian King [Thu, 29 Feb 2024 22:22:50 +0000 (22:22 +0000)]
fs/9p: remove redundant pointer v9ses
Pointer v9ses is being assigned the value from the return of inlined
function v9fs_inode2v9ses (which just returns inode->i_sb->s_fs_info).
The pointer is not used after the assignment, so the variable is
redundant and can be removed.
Cleans up clang scan warnings such as:
fs/9p/vfs_inode_dotl.c:300:28: warning: variable 'v9ses' set but not
used [-Wunused-but-set-variable]
Linus Torvalds [Sun, 24 Mar 2024 20:54:06 +0000 (13:54 -0700)]
Merge tag 'efi-fixes-for-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
- Fix logic that is supposed to prevent placement of the kernel image
below LOAD_PHYSICAL_ADDR
- Use the firmware stack in the EFI stub when running in mixed mode
- Clear BSS only once when using mixed mode
- Check efi.get_variable() function pointer for NULL before trying to
call it
* tag 'efi-fixes-for-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi: fix panic in kdump kernel
x86/efistub: Don't clear BSS twice in mixed mode
x86/efistub: Call mixed mode boot services on the firmware's stack
efi/libstub: fix efi_random_alloc() to allocate memory at alloc_min or higher address
Linus Torvalds [Sun, 24 Mar 2024 18:13:56 +0000 (11:13 -0700)]
Merge tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
- Ensure that the encryption mask at boot is properly propagated on
5-level page tables, otherwise the PGD entry is incorrectly set to
non-encrypted, which causes system crashes during boot.
- Undo the deferred 5-level page table setup as it cannot work with
memory encryption enabled.
- Prevent inconsistent XFD state on CPU hotplug, where the MSR is reset
to the default value but the cached variable is not, so subsequent
comparisons might yield the wrong result and as a consequence the
result prevents updating the MSR.
- Register the local APIC address only once in the MPPARSE enumeration
to prevent triggering the related WARN_ONs() in the APIC and topology
code.
- Handle the case where no APIC is found gracefully by registering a
fake APIC in the topology code. That makes all related topology
functions work correctly and does not affect the actual APIC driver
code at all.
- Don't evaluate logical IDs during early boot as the local APIC IDs
are not yet enumerated and the invoked function returns an error
code. Nothing requires the logical IDs before the final CPUID
enumeration takes place, which happens after the enumeration.
- Cure the fallout of the per CPU rework on UP which misplaced the
copying of boot_cpu_data to per CPU data so that the final update to
boot_cpu_data got lost which caused inconsistent state and boot
crashes.
- Use copy_from_kernel_nofault() in the kprobes setup as there is no
guarantee that the address can be safely accessed.
- Reorder struct members in struct saved_context to work around another
kmemleak false positive
- Remove the buggy code which tries to update the E820 kexec table for
setup_data as that is never passed to the kexec kernel.
- Update the resource control documentation to use the proper units.
- Fix a Kconfig warning observed with tinyconfig
* tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot/64: Move 5-level paging global variable assignments back
x86/boot/64: Apply encryption mask to 5-level pagetable update
x86/cpu: Add model number for another Intel Arrow Lake mobile processor
x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD
Documentation/x86: Document that resctrl bandwidth control units are MiB
x86/mpparse: Register APIC address only once
x86/topology: Handle the !APIC case gracefully
x86/topology: Don't evaluate logical IDs during early boot
x86/cpu: Ensure that CPU info updates are propagated on UP
kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address
x86/pm: Work around false positive kmemleak report in msr_build_context()
x86/kexec: Do not update E820 kexec table for setup_data
x86/config: Fix warning for 'make ARCH=x86_64 tinyconfig'
Linus Torvalds [Sun, 24 Mar 2024 18:11:05 +0000 (11:11 -0700)]
Merge tag 'sched-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler doc clarification from Thomas Gleixner:
"A single update for the documentation of the base_slice_ns tunable to
clarify that any value which is less than the tick slice has no effect
because the scheduler tick is not guaranteed to happen within the set
time slice"
* tag 'sched-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/doc: Update documentation for base_slice_ns and CONFIG_HZ relation
Linus Torvalds [Sun, 24 Mar 2024 17:45:31 +0000 (10:45 -0700)]
Merge tag 'dma-mapping-6.9-2024-03-24' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
"This has a set of swiotlb alignment fixes for sometimes very long
standing bugs from Will. We've been discussion them for a while and
they should be solid now"
* tag 'dma-mapping-6.9-2024-03-24' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: Reinstate page-alignment for mappings >= PAGE_SIZE
iommu/dma: Force swiotlb_max_mapping_size on an untrusted device
swiotlb: Fix alignment checks when both allocation and DMA masks are present
swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc()
swiotlb: Enforce page alignment in swiotlb_alloc()
swiotlb: Fix double-allocation of slots due to broken alignment handling
Check if get_next_variable() is actually valid pointer before
calling it. In kdump kernel this method is set to NULL that causes
panic during the kexec-ed kernel boot.
Tested with QEMU and OVMF firmware.
Fixes: bad267f9e18f ("efi: verify that variable services are supported") Signed-off-by: Oleksandr Tymoshenko <[email protected]> Signed-off-by: Ard Biesheuvel <[email protected]>
Ard Biesheuvel [Fri, 22 Mar 2024 16:01:45 +0000 (17:01 +0100)]
x86/efistub: Don't clear BSS twice in mixed mode
Clearing BSS should only be done once, at the very beginning.
efi_pe_entry() is the entrypoint from the firmware, which may not clear
BSS and so it is done explicitly. However, efi_pe_entry() is also used
as an entrypoint by the mixed mode startup code, in which case BSS will
already have been cleared, and doing it again at this point will corrupt
global variables holding the firmware's GDT/IDT and segment selectors.
So make the memset() conditional on whether the EFI stub is running in
native mode.
Ard Biesheuvel [Fri, 22 Mar 2024 15:03:58 +0000 (17:03 +0200)]
x86/efistub: Call mixed mode boot services on the firmware's stack
Normally, the EFI stub calls into the EFI boot services using the stack
that was live when the stub was entered. According to the UEFI spec,
this stack needs to be at least 128k in size - this might seem large but
all asynchronous processing and event handling in EFI runs from the same
stack and so quite a lot of space may be used in practice.
In mixed mode, the situation is a bit different: the bootloader calls
the 32-bit EFI stub entry point, which calls the decompressor's 32-bit
entry point, where the boot stack is set up, using a fixed allocation
of 16k. This stack is still in use when the EFI stub is started in
64-bit mode, and so all calls back into the EFI firmware will be using
the decompressor's limited boot stack.
Due to the placement of the boot stack right after the boot heap, any
stack overruns have gone unnoticed. However, commit
5c4feadb0011983b ("x86/decompressor: Move global symbol references to C code")
moved the definition of the boot heap into C code, and now the boot
stack is placed right at the base of BSS, where any overruns will
corrupt the end of the .data section.
While it would be possible to work around this by increasing the size of
the boot stack, doing so would affect all x86 systems, and mixed mode
systems are a tiny (and shrinking) fraction of the x86 installed base.
So instead, record the firmware stack pointer value when entering from
the 32-bit firmware, and switch to this stack every time a EFI boot
service call is made.
Tom Lendacky [Fri, 22 Mar 2024 15:41:07 +0000 (10:41 -0500)]
x86/boot/64: Move 5-level paging global variable assignments back
Commit 63bed9660420 ("x86/startup_64: Defer assignment of 5-level paging
global variables") moved assignment of 5-level global variables to later
in the boot in order to avoid having to use RIP relative addressing in
order to set them. However, when running with 5-level paging and SME
active (mem_encrypt=on), the variables are needed as part of the page
table setup needed to encrypt the kernel (using pgd_none(), p4d_offset(),
etc.). Since the variables haven't been set, the page table manipulation
is done as if 4-level paging is active, causing the system to crash on
boot.
While only a subset of the assignments that were moved need to be set
early, move all of the assignments back into check_la57_support() so that
these assignments aren't spread between two locations. Instead of just
reverting the fix, this uses the new RIP_REL_REF() macro when assigning
the variables.
Tom Lendacky [Fri, 22 Mar 2024 15:41:06 +0000 (10:41 -0500)]
x86/boot/64: Apply encryption mask to 5-level pagetable update
When running with 5-level page tables, the kernel mapping PGD entry is
updated to point to the P4D table. The assignment uses _PAGE_TABLE_NOENC,
which, when SME is active (mem_encrypt=on), results in a page table
entry without the encryption mask set, causing the system to crash on
boot.
Change the assignment to use _PAGE_TABLE instead of _PAGE_TABLE_NOENC so
that the encryption mask is set for the PGD entry.
Adamos Ttofari [Fri, 22 Mar 2024 23:04:39 +0000 (16:04 -0700)]
x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD
Commit 672365477ae8 ("x86/fpu: Update XFD state where required") and
commit 8bf26758ca96 ("x86/fpu: Add XFD state to fpstate") introduced a
per CPU variable xfd_state to keep the MSR_IA32_XFD value cached, in
order to avoid unnecessary writes to the MSR.
On CPU hotplug MSR_IA32_XFD is reset to the init_fpstate.xfd, which
wipes out any stale state. But the per CPU cached xfd value is not
reset, which brings them out of sync.
As a consequence a subsequent xfd_update_state() might fail to update
the MSR which in turn can result in XRSTOR raising a #NM in kernel
space, which crashes the kernel.
To fix this, introduce xfd_set_state() to write xfd_state together
with MSR_IA32_XFD, and use it in all places that set MSR_IA32_XFD.
Tony Luck [Fri, 22 Mar 2024 18:20:15 +0000 (11:20 -0700)]
Documentation/x86: Document that resctrl bandwidth control units are MiB
The memory bandwidth software controller uses 2^20 units rather than
10^6. See mbm_bw_count() which computes bandwidth using the "SZ_1M"
Linux define for 0x00100000.
Update the documentation to use MiB when describing this feature.
It's too late to fix the mount option "mba_MBps" as that is now an
established user interface.
Linus Torvalds [Sat, 23 Mar 2024 21:49:25 +0000 (14:49 -0700)]
Merge tag 'timers-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"Two regression fixes for the timer and timer migration code:
- Prevent endless timer requeuing which is caused by two CPUs racing
out of idle. This happens when the last CPU goes idle and therefore
has to ensure to expire the pending global timers and some other
CPU come out of idle at the same time and the other CPU wins the
race and expires the global queue. This causes the last CPU to
chase ghost timers forever and reprogramming it's clockevent device
endlessly.
Cure this by re-evaluating the wakeup time unconditionally.
- The split into local (pinned) and global timers in the timer wheel
caused a regression for NOHZ full as it broke the idle tracking of
global timers. On NOHZ full this prevents an self IPI being sent
which in turn causes the timer to be not programmed and not being
expired on time.
Restore the idle tracking for the global timer base so that the
self IPI condition for NOHZ full is working correctly again"
* tag 'timers-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Fix removed self-IPI on global timer's enqueue in nohz_full
timers/migration: Fix endless timer requeue after idle interrupts
Linus Torvalds [Sat, 23 Mar 2024 21:42:45 +0000 (14:42 -0700)]
Merge tag 'timers-core-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more clocksource updates from Thomas Gleixner:
"A set of updates for clocksource and clockevent drivers:
- A fix for the prescaler of the ARM global timer where the prescaler
mask define only covered 4 bits while it is actully 8 bits wide.
This obviously restricted the possible range of prescaler
adjustments
- A fix for the RISC-V timer which prevents a timer interrupt being
raised while the timer is initialized
- A set of device tree updates to support new system on chips in
various drivers
- Kernel-doc and other cleanups all over the place"
* tag 'timers-core-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/timer-riscv: Clear timer interrupt on timer initialization
dt-bindings: timer: Add support for cadence TTC PWM
clocksource/drivers/arm_global_timer: Simplify prescaler register access
clocksource/drivers/arm_global_timer: Guard against division by zero
clocksource/drivers/arm_global_timer: Make gt_target_rate unsigned long
dt-bindings: timer: add Ralink SoCs system tick counter
clocksource: arm_global_timer: fix non-kernel-doc comment
clocksource/drivers/arm_global_timer: Remove stray tab
clocksource/drivers/arm_global_timer: Fix maximum prescaler value
clocksource/drivers/imx-sysctr: Add i.MX95 support
clocksource/drivers/imx-sysctr: Drop use global variables
dt-bindings: timer: nxp,sysctr-timer: support i.MX95
dt-bindings: timer: renesas: ostm: Document RZ/Five SoC
dt-bindings: timer: renesas,tmu: Document input capture interrupt
clocksource/drivers/ti-32K: Fix misuse of "/**" comment
clocksource/drivers/stm32: Fix all kernel-doc warnings
dt-bindings: timer: exynos4210-mct: Add google,gs101-mct compatible
clocksource/drivers/imx: Fix -Wunused-but-set-variable warning
Linus Torvalds [Sat, 23 Mar 2024 21:30:38 +0000 (14:30 -0700)]
Merge tag 'irq-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A series of fixes for the Renesas RZG21 interrupt chip driver to
prevent spurious and misrouted interrupts.
- Ensure that posted writes are flushed in the eoi() callback
- Ensure that interrupts are masked at the chip level when the
trigger type is changed
- Clear the interrupt status register when setting up edge type
trigger modes
- Ensure that the trigger type and routing information is set before
the interrupt is enabled"
* tag 'irq-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/renesas-rzg2l: Do not set TIEN and TINT source at the same time
irqchip/renesas-rzg2l: Prevent spurious interrupts when setting trigger type
irqchip/renesas-rzg2l: Rename rzg2l_irq_eoi()
irqchip/renesas-rzg2l: Rename rzg2l_tint_eoi()
irqchip/renesas-rzg2l: Flush posted write in irq_eoi()
Linus Torvalds [Sat, 23 Mar 2024 21:17:37 +0000 (14:17 -0700)]
Merge tag 'core-entry-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core entry fix from Thomas Gleixner:
"A single fix for the generic entry code:
The trace_sys_enter() tracepoint can modify the syscall number via
kprobes or BPF in pt_regs, but that requires that the syscall number
is re-evaluted from pt_regs after the tracepoint.
A seccomp fix in that area removed the re-evaluation so the change
does not take effect as the code just uses the locally cached number.
Restore the original behaviour by re-evaluating the syscall number
after the tracepoint"
* tag 'core-entry-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
entry: Respect changes to system call number by trace_sys_enter()
Linus Torvalds [Sat, 23 Mar 2024 16:21:26 +0000 (09:21 -0700)]
Merge tag 'powerpc-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull more powerpc updates from Michael Ellerman:
- Handle errors in mark_rodata_ro() and mark_initmem_nx()
- Make struct crash_mem available without CONFIG_CRASH_DUMP
Thanks to Christophe Leroy and Hari Bathini.
* tag 'powerpc-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/kdump: Split KEXEC_CORE and CRASH_DUMP dependency
powerpc/kexec: split CONFIG_KEXEC_FILE and CONFIG_CRASH_DUMP
kexec/kdump: make struct crash_mem available without CONFIG_CRASH_DUMP
powerpc: Handle error in mark_rodata_ro() and mark_initmem_nx()
Linus Torvalds [Sat, 23 Mar 2024 16:17:03 +0000 (09:17 -0700)]
Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM updates from Russell King:
- remove a misuse of kernel-doc comment
- use "Call trace:" for backtraces like other architectures
- implement copy_from_kernel_nofault_allowed() to fix a LKDTM test
- add a "cut here" line for prefetch aborts
- remove unnecessary Kconfing entry for FRAME_POINTER
- remove iwmmxy support for PJ4/PJ4B cores
- use bitfield helpers in ptrace to improve readabililty
- check if folio is reserved before flushing
* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: 9359/1: flush: check if the folio is reserved for no-mapping addresses
ARM: 9354/1: ptrace: Use bitfield helpers
ARM: 9352/1: iwmmxt: Remove support for PJ4/PJ4B cores
ARM: 9353/1: remove unneeded entry for CONFIG_FRAME_POINTER
ARM: 9351/1: fault: Add "cut here" line for prefetch aborts
ARM: 9350/1: fault: Implement copy_from_kernel_nofault_allowed()
ARM: 9349/1: unwind: Add missing "Call trace:" line
ARM: 9334/1: mm: init: remove misuse of kernel-doc comment
Linus Torvalds [Sat, 23 Mar 2024 15:43:21 +0000 (08:43 -0700)]
Merge tag 'hardening-v6.9-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull more hardening updates from Kees Cook:
- CONFIG_MEMCPY_SLOW_KUNIT_TEST is no longer needed (Guenter Roeck)
- Fix needless UTF-8 character in arch/Kconfig (Liu Song)
- Improve __counted_by warning message in LKDTM (Nathan Chancellor)
- Refactor DEFINE_FLEX() for default use of __counted_by
- Disable signed integer overflow sanitizer on GCC < 8
* tag 'hardening-v6.9-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
lkdtm/bugs: Improve warning message for compilers without counted_by support
overflow: Change DEFINE_FLEX to take __counted_by member
Revert "kunit: memcpy: Split slow memcpy tests into MEMCPY_SLOW_KUNIT_TEST"
arch/Kconfig: eliminate needless UTF-8 character in Kconfig help
ubsan: Disable signed integer overflow sanitizer on GCC < 8
Thomas Gleixner [Fri, 22 Mar 2024 18:56:39 +0000 (19:56 +0100)]
x86/mpparse: Register APIC address only once
The APIC address is registered twice. First during the early detection and
afterwards when actually scanning the table for APIC IDs. The APIC and
topology core warn about the second attempt.
Thomas Gleixner [Fri, 22 Mar 2024 18:56:38 +0000 (19:56 +0100)]
x86/topology: Handle the !APIC case gracefully
If there is no local APIC enumerated and registered then the topology
bitmaps are empty. Therefore, topology_init_possible_cpus() will die with
a division by zero exception.
Prevent this by registering a fake APIC id to populate the topology
bitmap. This also allows to use all topology query interfaces
unconditionally. It does not affect the actual APIC code because either
the local APIC address was not registered or no local APIC could be
detected.
Thomas Gleixner [Fri, 22 Mar 2024 18:56:35 +0000 (19:56 +0100)]
x86/cpu: Ensure that CPU info updates are propagated on UP
The boot sequence evaluates CPUID information twice:
1) During early boot
2) When finalizing the early setup right before
mitigations are selected and alternatives are patched.
In both cases the evaluation is stored in boot_cpu_data, but on UP the
copying of boot_cpu_data to the per CPU info of the boot CPU happens
between #1 and #2. So any update which happens in #2 is never propagated to
the per CPU info instance.
Consolidate the whole logic and copy boot_cpu_data right before applying
alternatives as that's the point where boot_cpu_data is in it's final
state and not supposed to change anymore.
This also removes the voodoo mb() from smp_prepare_cpus_common() which
had absolutely no purpose.
Puranjay Mohan [Fri, 22 Mar 2024 15:35:18 +0000 (15:35 +0000)]
bpf: verifier: reject addr_space_cast insn without arena
The verifier allows using the addr_space_cast instruction in a program
that doesn't have an associated arena. This was caught in the form an
invalid memory access in do_misc_fixups() when while converting
addr_space_cast to a normal 32-bit mov, env->prog->aux->arena was
dereferenced to check for BPF_F_NO_USER_CONV flag.
Reject programs that include the addr_space_cast instruction but don't
have an associated arena.
Puranjay Mohan [Thu, 21 Mar 2024 15:39:39 +0000 (15:39 +0000)]
bpf: verifier: fix addr_space_cast from as(1) to as(0)
The verifier currently converts addr_space_cast from as(1) to as(0) that
is: BPF_ALU64 | BPF_MOV | BPF_X with off=1 and imm=1
to
BPF_ALU | BPF_MOV | BPF_X with imm=1 (32-bit mov)
Because of this imm=1, the JITs that have bpf_jit_needs_zext() == true,
interpret the converted instruction as BPF_ZEXT_REG(DST) which is a
special form of mov32, used for doing explicit zero extension on dst.
These JITs will just zero extend the dst reg and will not move the src to
dst before the zext.
Fix do_misc_fixups() to set imm=0 when converting addr_space_cast to a
normal mov32.
The JITs that have bpf_jit_needs_zext() == true rely on the verifier to
emit zext instructions. Mark dst_reg as subreg when doing cast from
as(1) to as(0) so the verifier emits a zext instruction after the mov.
Ido Schimmel [Thu, 21 Mar 2024 17:30:42 +0000 (19:30 +0200)]
ipv6: Fix address dump when IPv6 is disabled on an interface
Cited commit started returning an error when user space requests to dump
the interface's IPv6 addresses and IPv6 is disabled on the interface.
Restore the previous behavior and do not return an error.
Before cited commit:
# ip address show dev dummy1
2: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 1a:52:02:5a:c2:6e brd ff:ff:ff:ff:ff:ff
inet6 fe80::1852:2ff:fe5a:c26e/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
# ip link set dev dummy1 mtu 1000
# ip address show dev dummy1
2: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1000 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 1a:52:02:5a:c2:6e brd ff:ff:ff:ff:ff:ff
After cited commit:
# ip address show dev dummy1
2: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 1e:9b:94:00:ac:e8 brd ff:ff:ff:ff:ff:ff
inet6 fe80::1c9b:94ff:fe00:ace8/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
# ip link set dev dummy1 mtu 1000
# ip address show dev dummy1
RTNETLINK answers: No such device
Dump terminated
With this patch:
# ip address show dev dummy1
2: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 42:35:fc:53:66:cf brd ff:ff:ff:ff:ff:ff
inet6 fe80::4035:fcff:fe53:66cf/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
# ip link set dev dummy1 mtu 1000
# ip address show dev dummy1
2: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1000 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 42:35:fc:53:66:cf brd ff:ff:ff:ff:ff:ff
Jakub Kicinski [Thu, 21 Mar 2024 02:02:14 +0000 (19:02 -0700)]
tools: ynl: fix setting presence bits in simple nests
When we set members of simple nested structures in requests
we need to set "presence" bits for all the nesting layers
below. This has nothing to do with the presence type of
the last layer.
lkdtm/bugs: Improve warning message for compilers without counted_by support
The current message for telling the user that their compiler does not
support the counted_by attribute in the FAM_BOUNDS test does not make
much sense either grammatically or semantically. Fix it to make it
correct in both aspects.
Kees Cook [Wed, 6 Mar 2024 23:51:36 +0000 (15:51 -0800)]
overflow: Change DEFINE_FLEX to take __counted_by member
The norm should be flexible array structures with __counted_by
annotations, so DEFINE_FLEX() is updated to expect that. Rename
the non-annotated version to DEFINE_RAW_FLEX(), and update the
few existing users. Additionally add selftests for the macros.
Linus Torvalds [Fri, 22 Mar 2024 20:31:07 +0000 (13:31 -0700)]
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull more SCSI updates from James Bottomley:
"The vfs has long had a write lifetime hint mechanism that gives the
expected longevity on storage of the data being written. f2fs was the
original consumer of this and used the hint for flash data placement
(mostly to avoid write amplification by placing objects with similar
lifetimes in the same erase block).
More recently the SCSI based UFS (Universal Flash Storage) drivers
have wanted to take advantage of this as well, for the same reasons as
f2fs, necessitating plumbing the write hints through the block layer
and then adding it to the SCSI core.
The vfs write_hints already taken plumbs this as far as block and this
completes the SCSI core enabling based on a recently agreed reuse of
the old write command group number. The additions to the scsi_debug
driver are for emulating this property so we can run tests on it in
the absence of an actual UFS device"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: scsi_debug: Maintain write statistics per group number
scsi: scsi_debug: Implement GET STREAM STATUS
scsi: scsi_debug: Implement the IO Advice Hints Grouping mode page
scsi: scsi_debug: Allocate the MODE SENSE response from the heap
scsi: scsi_debug: Rework subpage code error handling
scsi: scsi_debug: Rework page code error handling
scsi: scsi_debug: Support the block limits extension VPD page
scsi: scsi_debug: Reduce code duplication
scsi: sd: Translate data lifetime information
scsi: scsi_proto: Add structures and constants related to I/O groups and streams
scsi: core: Query the Block Limits Extension VPD page
Linus Torvalds [Fri, 22 Mar 2024 19:46:07 +0000 (12:46 -0700)]
Merge tag 'block-6.9-20240322' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
- NVMe pull request via Keith:
- Make an informative message less ominous (Keith)
- Enhanced trace decoding (Guixin)
- TCP updates (Hannes, Li)
- Fabrics connect deadlock fix (Chunguang)
- Platform API migration update (Uwe)
- A new device quirk (Jiawei)
- Remove dead assignment in fd (Yufeng)
* tag 'block-6.9-20240322' of git://git.kernel.dk/linux:
nvmet-rdma: remove NVMET_RDMA_REQ_INVALIDATE_RKEY flag
nvme: remove redundant BUILD_BUG_ON check
floppy: remove duplicated code in redo_fd_request()
nvme/tcp: Add wq_unbound modparam for nvme_tcp_wq
nvme-tcp: Export the nvme_tcp_wq to sysfs
drivers/nvme: Add quirks for device 126f:2262
nvme: parse format command's lbafu when tracing
nvme: add tracing of reservation commands
nvme: parse zns command's zsa and zrasf to string
nvme: use nvme_disk_is_ns_head helper
nvme: fix reconnection fail due to reserved tag allocation
nvmet: add tracing of zns commands
nvmet: add tracing of authentication commands
nvme-apple: Convert to platform remove callback returning void
nvmet-tcp: do not continue for invalid icreq
nvme: change shutdown timeout setting message
Linus Torvalds [Fri, 22 Mar 2024 19:42:55 +0000 (12:42 -0700)]
Merge tag 'io_uring-6.9-20240322' of git://git.kernel.dk/linux
Pull more io_uring updates from Jens Axboe:
"One patch just missed the initial pull, the rest are either fixes or
small cleanups that make our life easier for the next kernel:
- Fix a potential leak in error handling of pinned pages, and clean
it up (Gabriel, Pavel)
- Fix an issue with how read multishot returns retry (me)
- Fix a problem with waitid/futex removals, if we hit the case of
needing to remove all of them at exit time (me)
- Fix for a regression introduced in this merge window, where we
don't always have sr->done_io initialized if the ->prep_async()
path is used (me)
- Fix for SQPOLL setup error handling (me)
- Fix for a poll removal request being delayed (Pavel)
- Rename of a struct member which had a confusing name (Pavel)"
* tag 'io_uring-6.9-20240322' of git://git.kernel.dk/linux:
io_uring/sqpoll: early exit thread if task_context wasn't allocated
io_uring: clear opcode specific data for an early failure
io_uring/net: ensure async prep handlers always initialize ->done_io
io_uring/waitid: always remove waitid entry for cancel all
io_uring/futex: always remove futex entry for cancel all
io_uring: fix poll_remove stalled req completion
io_uring: Fix release of pinned pages when __io_uaddr_map fails
io_uring/kbuf: rename is_mapped
io_uring: simplify io_pages_free
io_uring: clean rings on NO_MMAP alloc fail
io_uring/rw: return IOU_ISSUE_SKIP_COMPLETE for multishot retry
io_uring: don't save/restore iowait state
Linus Torvalds [Fri, 22 Mar 2024 19:34:26 +0000 (12:34 -0700)]
Merge tag 'for-6.9/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- Fix a memory leak in DM integrity recheck code that was added during
the 6.9 merge. Also fix the recheck code to ensure it issues bios
with proper alignment.
- Fix DM snapshot's dm_exception_table_exit() to schedule while
handling an large exception table during snapshot device shutdown.
* tag 'for-6.9/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm-integrity: align the outgoing bio in integrity_recheck
dm snapshot: fix lockup in dm_exception_table_exit
dm-integrity: fix a memory leak when rechecking the data
Linus Torvalds [Fri, 22 Mar 2024 18:15:45 +0000 (11:15 -0700)]
Merge tag 'ceph-for-6.9-rc1' of https://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"A patch to minimize blockage when processing very large batches of
dirty caps and two fixes to better handle EOF in the face of multiple
clients performing reads and size-extending writes at the same time"
* tag 'ceph-for-6.9-rc1' of https://github.com/ceph/ceph-client:
ceph: set correct cap mask for getattr request for read
ceph: stop copying to iter at EOF on sync reads
ceph: remove SLAB_MEM_SPREAD flag usage
ceph: break the check delayed cap loop every 5s
Linus Torvalds [Fri, 22 Mar 2024 18:12:21 +0000 (11:12 -0700)]
Merge tag 'xfs-6.9-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Chandan Babu:
- Fix invalid pointer dereference by initializing xmbuf before
tracepoint function is invoked
- Use memalloc_nofs_save() when inserting into quota radix tree
* tag 'xfs-6.9-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: quota radix tree allocations need to be NOFS on insert
xfs: fix dev_t usage in xmbuf tracepoints
Linus Torvalds [Fri, 22 Mar 2024 17:22:45 +0000 (10:22 -0700)]
Merge tag 'loongarch-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch updates from Huacai Chen:
- Add objtool support for LoongArch
- Add ORC stack unwinder support for LoongArch
- Add kernel livepatching support for LoongArch
- Select ARCH_HAS_CURRENT_STACK_POINTER in Kconfig
- Select HAVE_ARCH_USERFAULTFD_MINOR in Kconfig
- Some bug fixes and other small changes
* tag 'loongarch-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch/crypto: Clean up useless assignment operations
LoongArch: Define the __io_aw() hook as mmiowb()
LoongArch: Remove superfluous flush_dcache_page() definition
LoongArch: Move {dmw,tlb}_virt_to_page() definition to page.h
LoongArch: Change __my_cpu_offset definition to avoid mis-optimization
LoongArch: Select HAVE_ARCH_USERFAULTFD_MINOR in Kconfig
LoongArch: Select ARCH_HAS_CURRENT_STACK_POINTER in Kconfig
LoongArch: Add kernel livepatching support
LoongArch: Add ORC stack unwinder support
objtool: Check local label in read_unwind_hints()
objtool: Check local label in add_dead_ends()
objtool/LoongArch: Enable orc to be built
objtool/x86: Separate arch-specific and generic parts
objtool/LoongArch: Implement instruction decoder
objtool/LoongArch: Enable objtool to be built
Linus Torvalds [Fri, 22 Mar 2024 17:09:08 +0000 (10:09 -0700)]
Merge tag 'fbdev-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev
Pull fbdev updates from Helge Deller:
- Allow console fonts up to 64x128 pixels (Samuel Thibault)
- Prevent division-by-zero in fb monitor code (Roman Smirnov)
- Drop Renesas ARM platforms from Mobile LCDC framebuffer driver (Geert
Uytterhoeven)
- Various code cleanups in viafb, uveafb and mb862xxfb drivers by
Aleksandr Burakov, Li Zhijian and Michael Ellerman
* tag 'fbdev-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
fbdev: panel-tpo-td043mtea1: Convert sprintf() to sysfs_emit()
fbmon: prevent division by zero in fb_videomode_from_videomode()
fbcon: Increase maximum font width x height to 64 x 128
fbdev: viafb: fix typo in hw_bitblt_1 and hw_bitblt_2
fbdev: mb862xxfb: Fix defined but not used error
fbdev: uvesafb: Convert sprintf/snprintf to sysfs_emit
fbdev: Restrict FB_SH_MOBILE_LCDC to SuperH
Linus Torvalds [Fri, 22 Mar 2024 16:57:00 +0000 (09:57 -0700)]
Merge tag 'spi-fix-v6.9-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A small collection of fixes that came in since the merge window. Most
of it is relatively minor driver specific fixes, there's also fixes
for error handling with SPI flash devices and a fix restoring delay
control functionality for non-GPIO chip selects managed by the core"
* tag 'spi-fix-v6.9-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: spi-mt65xx: Fix NULL pointer access in interrupt handler
spi: docs: spidev: fix echo command format
spi: spi-imx: fix off-by-one in mx51 CPU mode burst length
spi: lm70llp: fix links in doc and comments
spi: Fix error code checking in spi_mem_exec_op()
spi: Restore delays for non-GPIO chip select
spi: lpspi: Avoid potential use-after-free in probe()
Linus Torvalds [Fri, 22 Mar 2024 16:52:37 +0000 (09:52 -0700)]
Merge tag 'regulator-fix-v6.9-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fix from Mark Brown:
"One fix that came in during the merge window, fixing a problem with
bootstrapping the state of exclusive regulators which have a parent
regulator"
* tag 'regulator-fix-v6.9-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: core: Propagate the regulator state in case of exclusive get
Linus Torvalds [Fri, 22 Mar 2024 16:44:19 +0000 (09:44 -0700)]
Merge tag 'sound-fix2-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull more sound fixes from Takashi Iwai:
"The remaining fixes for 6.9-rc1 that have been gathered in this week.
More about ASoC at this time (one long-standing fix for compress
offload, SOF, AMD ACP, Rockchip, Cirrus and tlv320 stuff) while
another regression fix in ALSA core and a couple of HD-audio quirks as
usual are included"
* tag 'sound-fix2-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: control: Fix unannotated kfree() cleanup
ALSA: hda/realtek: Add quirks for some Clevo laptops
ALSA: hda/realtek: Add quirk for HP Spectre x360 14 eu0000
ALSA: hda/realtek: fix the hp playback volume issue for LG machines
ASoC: soc-compress: Fix and add DPCM locking
ASoC: SOF: amd: Skip IRAM/DRAM size modification for Steam Deck OLED
ASoC: SOF: amd: Move signed_fw_image to struct acp_quirk_entry
ASoC: amd: yc: Revert "add new YC platform variant (0x63) support"
ASoC: amd: yc: Revert "Fix non-functional mic on Lenovo 21J2"
ASoC: soc-core.c: Skip dummy codec when adding platforms
ASoC: rockchip: i2s-tdm: Fix inaccurate sampling rates
ASoC: dt-bindings: cirrus,cs42l43: Fix 'gpio-ranges' schema
ASoC: amd: yc: Fix non-functional mic on ASUS M7600RE
ASoC: tlv320adc3xxx: Don't strip remove function when driver is builtin
efi/libstub: fix efi_random_alloc() to allocate memory at alloc_min or higher address
Following warning is sometimes observed while booting my servers:
[ 3.594838] DMA: preallocated 4096 KiB GFP_KERNEL pool for atomic allocations
[ 3.602918] swapper/0: page allocation failure: order:10, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0-1
...
[ 3.851862] DMA: preallocated 1024 KiB GFP_KERNEL|GFP_DMA pool for atomic allocation
If 'nokaslr' boot option is set, the warning always happens.
On x86, ZONE_DMA is small zone at the first 16MB of physical address
space. When this problem happens, most of that space seems to be used by
decompressed kernel. Thereby, there is not enough space at DMA_ZONE to
meet the request of DMA pool allocation.
The commit 2f77465b05b1 ("x86/efistub: Avoid placing the kernel below
LOAD_PHYSICAL_ADDR") tried to fix this problem by introducing lower
bound of allocation.
But the fix is not complete.
efi_random_alloc() allocates pages by following steps.
1. Count total available slots ('total_slots')
2. Select a slot ('target_slot') to allocate randomly
3. Calculate a starting address ('target') to be included target_slot
4. Allocate pages, which starting address is 'target'
In step 1, 'alloc_min' is used to offset the starting address of memory
chunk. But in step 3 'alloc_min' is not considered at all. As the
result, 'target' can be miscalculated and become lower than 'alloc_min'.
When KASLR is disabled, 'target_slot' is always 0 and the problem
happens everytime if the EFI memory map of the system meets the
condition.
Fix this problem by calculating 'target' considering 'alloc_min'.
Eric Biggers [Wed, 13 Mar 2024 23:32:27 +0000 (16:32 -0700)]
Revert "crypto: pkcs7 - remove sha1 support"
This reverts commit 16ab7cb5825fc3425c16ad2c6e53d827f382d7c6 because it
broke iwd. iwd uses the KEYCTL_PKEY_* UAPIs via its dependency libell,
and apparently it is relying on SHA-1 signature support. These UAPIs
are fairly obscure, and their documentation does not mention which
algorithms they support. iwd really should be using a properly
supported userspace crypto library instead. Regardless, since something
broke we have to revert the change.
It may be possible that some parts of this commit can be reinstated
without breaking iwd (e.g. probably the removal of MODULE_SIG_SHA1), but
for now this just does a full revert to get things working again.
kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address
Read from an unsafe address with copy_from_kernel_nofault() in
arch_adjust_kprobe_addr() because this function is used before checking
the address is in text or not. Syzcaller bot found a bug and reported
the case if user specifies inaccessible data area,
arch_adjust_kprobe_addr() will cause a kernel panic.
x86/pm: Work around false positive kmemleak report in msr_build_context()
Since:
7ee18d677989 ("x86/power: Make restore_processor_context() sane")
kmemleak reports this issue:
unreferenced object 0xf68241e0 (size 32):
comm "swapper/0", pid 1, jiffies 4294668610 (age 68.432s)
hex dump (first 32 bytes):
00 cc cc cc 29 10 01 c0 00 00 00 00 00 00 00 00 ....)...........
00 42 82 f6 cc cc cc cc cc cc cc cc cc cc cc cc .B..............
backtrace:
[<461c1d50>] __kmem_cache_alloc_node+0x106/0x260
[<ea65e13b>] __kmalloc+0x54/0x160
[<c3858cd2>] msr_build_context.constprop.0+0x35/0x100
[<46635aff>] pm_check_save_msr+0x63/0x80
[<6b6bb938>] do_one_initcall+0x41/0x1f0
[<3f3add60>] kernel_init_freeable+0x199/0x1e8
[<3b538fde>] kernel_init+0x1a/0x110
[<938ae2b2>] ret_from_fork+0x1c/0x28
Which is a false positive.
Reproducer:
- Run rsync of whole kernel tree (multiple times if needed).
- start a kmemleak scan
- Note this is just an example: a lot of our internal tests hit these.
The root cause is similar to the fix in:
b0b592cf0836 x86/pm: Fix false positive kmemleak report in msr_build_context()
ie. the alignment within the packed struct saved_context
which has everything unaligned as there is only "u16 gs;" at start of
struct where in the past there were four u16 there thus aligning
everything afterwards. The issue is with the fact that Kmemleak only
searches for pointers that are aligned (see how pointers are scanned in
kmemleak.c) so when the struct members are not aligned it doesn't see
them.
Testing:
We run a lot of tests with our CI, and after applying this fix we do not
see any kmemleak issues any more whilst without it we see hundreds of
the above report. From a single, simple test run consisting of 416 individual test
cases on kernel 5.10 x86 with kmemleak enabled we got 20 failures due to this,
which is quite a lot. With this fix applied we get zero kmemleak related failures.
Ryosuke Yasuoka [Wed, 20 Mar 2024 00:54:10 +0000 (09:54 +0900)]
nfc: nci: Fix uninit-value in nci_dev_up and nci_ntf_packet
syzbot reported the following uninit-value access issue [1][2]:
nci_rx_work() parses and processes received packet. When the payload
length is zero, each message type handler reads uninitialized payload
and KMSAN detects this issue. The receipt of a packet with a zero-size
payload is considered unexpected, and therefore, such packets should be
silently discarded.
This patch resolved this issue by checking payload size before calling
each message type handler codes.
Fixes: 6a2968aaf50c ("NFC: basic NCI protocol implementation") Reported-and-tested-by: [email protected] Reported-and-tested-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=7ea9413ea6749baf5574 [1] Closes: https://syzkaller.appspot.com/bug?extid=29b5ca705d2e0f4a44d2 [2] Signed-off-by: Ryosuke Yasuoka <[email protected]> Reviewed-by: Jeremy Cline <[email protected]> Reviewed-by: Krzysztof Kozlowski <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Dave Young [Fri, 22 Mar 2024 05:15:08 +0000 (13:15 +0800)]
x86/kexec: Do not update E820 kexec table for setup_data
crashkernel reservation failed on a Thinkpad t440s laptop recently.
Actually the memblock reservation succeeded, but later insert_resource()
failed.
Test steps:
kexec load -> /* make sure add crashkernel param eg. crashkernel=160M */
kexec reboot ->
dmesg|grep "crashkernel reserved";
crashkernel memory range like below reserved successfully:
0x00000000d0000000 - 0x00000000da000000
But no such "Crash kernel" region in /proc/iomem
The background story:
Currently the E820 code reserves setup_data regions for both the current
kernel and the kexec kernel, and it inserts them into the resources list.
Before the kexec kernel reboots nobody passes the old setup_data, and
kexec only passes fresh SETUP_EFI/SETUP_IMA/SETUP_RNG_SEED if needed.
Thus the old setup data memory is not used at all.
Due to old kernel updates the kexec e820 table as well so kexec kernel
sees them as E820_TYPE_RESERVED_KERN regions, and later the old setup_data
regions are inserted into resources list in the kexec kernel by
e820__reserve_resources().
Note, due to no setup_data is passed in for those old regions they are not
early reserved (by function early_reserve_memory), and the crashkernel
memblock reservation will just treat them as usable memory and it could
reserve the crashkernel region which overlaps with the old setup_data
regions. And just like the bug I noticed here, kdump insert_resource
failed because e820__reserve_resources has added the overlapped chunks
in /proc/iomem already.
Finally, looking at the code, the old setup_data regions are not used
at all as no setup_data is passed in by the kexec boot loader. Although
something like SETUP_PCI etc could be needed, kexec should pass
the info as new setup_data so that kexec kernel can take care of them.
This should be taken care of in other separate patches if needed.
Justin Stitt [Thu, 21 Mar 2024 20:04:08 +0000 (20:04 +0000)]
binfmt: replace deprecated strncpy
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.
There is a _nearly_ identical implementation of fill_psinfo present in
binfmt_elf.c -- except that one uses get_task_comm over strncpy(). Let's
mirror that in binfmt_elf_fdpic.c
Linus Torvalds [Fri, 22 Mar 2024 02:14:28 +0000 (19:14 -0700)]
Merge tag '6.9-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- Various get_inode_info_fixes
- Fix for querying xattrs of cached dirs
- Four minor cleanup fixes (including adding some header corrections
and a missing flag)
- Performance improvement for deferred close
- Two query interface fixes
* tag '6.9-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
smb311: additional compression flag defined in updated protocol spec
smb311: correct incorrect offset field in compression header
cifs: Move some extern decls from .c files to .h
cifs: remove redundant variable assignment
cifs: fixes for get_inode_info
cifs: open_cached_dir(): add FILE_READ_EA to desired access
cifs: reduce warning log level for server not advertising interfaces
cifs: make sure server interfaces are requested only for SMB3+
cifs: defer close file handles having RH lease
Linus Torvalds [Fri, 22 Mar 2024 02:04:31 +0000 (19:04 -0700)]
Merge tag 'drm-next-2024-03-22' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Fixes from the last week (or 3 weeks in amdgpu case), after amdgpu,
it's xe and nouveau then a few scattered core fixes.
core:
- fix rounding in drm_fixp2int_round()
bridge:
- fix documentation for DRM_BRIDGE_OP_EDID
sun4i:
- fix 64-bit division on 32-bit architectures
tests:
- fix dependency on DRM_KMS_HELPER
probe-helper:
- never return negative values from .get_modes() plus driver fixes
xe:
- invalidate userptr vma on page pin fault
- fail early on sysfs file creation error
- skip VMA pinning on xe_exec if no batches
nouveau:
- clear bo resource bus after eviction
- documentation fixes
- don't check devinit disable on GSP
amdgpu:
- Freesync fixes
- UAF IOCTL fixes
- Fix mmhub client ID mapping
- IH 7.0 fix
- DML2 fixes
- VCN 4.0.6 fix
- GART bind fix
- GPU reset fix
- SR-IOV fix
- OD table handling fixes
- Fix TA handling on boards without display hardware
- DML1 fix
- ABM fix
- eDP panel fix
- DPPCLK fix
- HDCP fix
- Revert incorrect error case handling in ioremap
- VPE fix
- HDMI fixes
- SDMA 4.4.2 fix
- Other misc fixes
amdkfd:
- Fix duplicate BO handling in process restore"
* tag 'drm-next-2024-03-22' of https://gitlab.freedesktop.org/drm/kernel: (50 commits)
drm/amdgpu/pm: Don't use OD table on Arcturus
drm/amdgpu: drop setting buffer funcs in sdma442
drm/amd/display: Fix noise issue on HDMI AV mute
drm/amd/display: Revert Remove pixle rate limit for subvp
Revert "drm/amdgpu/vpe: don't emit cond exec command under collaborate mode"
Revert "drm/amd/amdgpu: Fix potential ioremap() memory leaks in amdgpu_device_init()"
drm/amd/display: Add a dc_state NULL check in dc_state_release
drm/amd/display: Return the correct HDCP error code
drm/amd/display: Implement wait_for_odm_update_pending_complete
drm/amd/display: Lock all enabled otg pipes even with no planes
drm/amd/display: Amend coasting vtotal for replay low hz
drm/amd/display: Fix idle check for shared firmware state
drm/amd/display: Update odm when ODM combine is changed on an otg master pipe with no plane
drm/amd/display: Init DPPCLK from SMU on dcn32
drm/amd/display: Add monitor patch for specific eDP
drm/amd/display: Allow dirty rects to be sent to dmub when abm is active
drm/amd/display: Override min required DCFCLK in dml1_validate
drm/amdgpu: Bypass display ta if display hw is not available
drm/amdgpu: correct the KGQ fallback message
drm/amdgpu/pm: Check the validity of overdiver power limit
...
Linus Torvalds [Fri, 22 Mar 2024 00:21:41 +0000 (17:21 -0700)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Re-instate the CPUMASK_OFFSTACK option for arm64 when NR_CPUS > 256.
The bug that led to the initial revert was the cpufreq-dt code not
using zalloc_cpumask_var().
- Make the STARFIVE_STARLINK_PMU config option depend on 64BIT to
prevent compile-test failures on 32-bit architectures due to missing
writeq().
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
perf: starfive: fix 64-bit only COMPILE_TEST condition
ARM64: Dynamically allocate cpumasks and increase supported CPUs to 512