ceph: handle idmapped mounts in create_request_message()
Inode operations that create a new filesystem object such as ->mknod,
->create, ->mkdir() and others don't take a {g,u}id argument explicitly.
Instead the caller's fs{g,u}id is used for the {g,u}id of the new
filesystem object.
In order to ensure that the correct {g,u}id is used map the caller's
fs{g,u}id for creation requests. This doesn't require complex changes.
It suffices to pass in the relevant idmapping recorded in the request
message. If this request message was triggered from an inode operation
that creates filesystem objects it will have passed down the relevant
idmaping. If this is a request message that was triggered from an inode
operation that doens't need to take idmappings into account the initial
idmapping is passed down which is an identity mapping.
This change uses a new cephfs protocol extension CEPHFS_FEATURE_HAS_OWNER_UIDGID
which adds two new fields (owner_{u,g}id) to the request head structure.
So, we need to ensure that MDS supports it otherwise we need to fail
any IO that comes through an idmapped mount because we can't process it
in a proper way. MDS server without such an extension will use caller_{u,g}id
fields to set a new inode owner UID/GID which is incorrect because caller_{u,g}id
values are unmapped. At the same time we can't map these fields with an
idmapping as it can break UID/GID-based permission checks logic on the
MDS side. This problem was described with a lot of details at [1], [2].
When sending a mds request cephfs will send relevant data for the
requested operation. For creation requests the caller's fs{g,u}id is
used to set the ownership of the newly created filesystem object. For
setattr requests the caller can pass in arbitrary {g,u}id values to
which the relevant filesystem object is supposed to be changed.
If the caller is performing the relevant operation via an idmapped mount
cephfs simply needs to take the idmapping into account when it sends the
relevant mds request.
In order to support idmapped mounts for cephfs we stash the idmapping
whenever they are relevant for the operation for the duration of the
request. Since mds requests can be queued and performed asynchronously
we make sure to keep the idmapping around and release it once the
request has finished.
In follow-up patches we will use this to send correct ownership
information over the wire. This patch just adds the basic infrastructure
to keep the idmapping around. The actual conversion patches are all
fairly minimal.
Xiubo Li [Mon, 12 Jun 2023 01:04:07 +0000 (09:04 +0800)]
ceph: print cluster fsid and client global_id in all debug logs
Multiple CephFS mounts on a host is increasingly common so
disambiguating messages like this is necessary and will make it easier
to debug issues.
At the same this will improve the debug logs to make them easier to
troubleshooting issues, such as print the ino# instead only printing
the memory addresses of the corresponding inodes and print the dentry
names instead of the corresponding memory addresses for the dentry,etc.
Xiubo Li [Mon, 12 Jun 2023 02:50:38 +0000 (10:50 +0800)]
ceph: rename _to_client() to _to_fs_client()
We need to covert the inode to ceph_client in the following commit,
and will add one new helper for that, here we rename the old helper
to _fs_client().
Linus Torvalds [Sat, 28 Oct 2023 18:15:07 +0000 (08:15 -1000)]
Merge tag 'x86-urgent-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Fix a possible CPU hotplug deadlock bug caused by the new TSC
synchronization code
- Fix a legacy PIC discovery bug that results in device troubles on
affected systems, such as non-working keybards, etc
- Add a new Intel CPU model number to <asm/intel-family.h>
* tag 'x86-urgent-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tsc: Defer marking TSC unstable to a worker
x86/i8259: Skip probing when ACPI/MADT advertises PCAT compatibility
x86/cpu: Add model number for Intel Arrow Lake mobile processor
Linus Torvalds [Sat, 28 Oct 2023 18:12:34 +0000 (08:12 -1000)]
Merge tag 'irq-urgent-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Ingo Molnar:
"Restore unintentionally lost quirk settings in the GIC irqchip driver,
which broke certain devices"
* tag 'irq-urgent-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3-its: Don't override quirk settings with default values
Linus Torvalds [Sat, 28 Oct 2023 18:04:56 +0000 (08:04 -1000)]
Merge tag 'probes-fixes-v6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- tracing/kprobes: Fix kernel-doc warnings for the variable length
arguments
- tracing/kprobes: Fix to count the symbols in modules even if the
module name is not specified so that user can probe the symbols in
the modules without module name
* tag 'probes-fixes-v6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/kprobes: Fix symbol counting logic by looking at modules as well
tracing/kprobes: Fix the description of variable length arguments
Linus Torvalds [Sat, 28 Oct 2023 17:51:27 +0000 (07:51 -1000)]
Merge tag 'char-misc-6.6-final' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are some very small driver fixes for 6.6-final that have shown up
in the past two weeks. Included in here are:
- tiny fastrpc bugfixes for reported errors
- nvmem register fixes
- iio driver fixes for some reported problems
- fpga test fix
- MAINTAINERS file update for fpga
All of these have been in linux-next this week with no reported
problems"
* tag 'char-misc-6.6-final' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
fpga: Fix memory leak for fpga_region_test_class_find()
fpga: m10bmc-sec: Change contact for secure update driver
fpga: disable KUnit test suites when module support is enabled
iio: afe: rescale: Accept only offset channels
nvmem: imx: correct nregs for i.MX6ULL
nvmem: imx: correct nregs for i.MX6UL
nvmem: imx: correct nregs for i.MX6SLL
misc: fastrpc: Unmap only if buffer is unmapped from DSP
misc: fastrpc: Clean buffers on remote invocation failures
misc: fastrpc: Free DMA handles for RPC calls with no arguments
misc: fastrpc: Reset metadata buffer to avoid incorrect free
iio: exynos-adc: request second interupt only when touchscreen mode is used
iio: adc: xilinx-xadc: Correct temperature offset/scale for UltraScale
iio: adc: xilinx-xadc: Don't clobber preset voltage/temperature thresholds
dt-bindings: iio: add missing reset-gpios constrain
Linus Torvalds [Sat, 28 Oct 2023 17:48:37 +0000 (07:48 -1000)]
Merge tag 'i2c-for-6.6-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Bugfixes for Axxia when it is a target and for PEC handling of
stm32f7.
Plus, fix an OF node leak pattern in the mux subsystem"
* tag 'i2c-for-6.6-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: stm32f7: Fix PEC handling in case of SMBUS transfers
i2c: muxes: i2c-mux-gpmux: Use of_get_i2c_adapter_by_node()
i2c: muxes: i2c-demux-pinctrl: Use of_get_i2c_adapter_by_node()
i2c: muxes: i2c-mux-pinctrl: Use of_get_i2c_adapter_by_node()
i2c: aspeed: Fix i2c bus hang in slave read
Linus Torvalds [Sat, 28 Oct 2023 02:52:51 +0000 (16:52 -1000)]
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"Three fixes, one for the clk framework and two for clk drivers:
- Avoid an oops in possible_parent_show() by checking for no parent
properly when a DT index based lookup is used
- Handle errors returned from divider_ro_round_rate() in
clk_stm32_composite_determine_rate()
- Fix clk_ops::determine_rate() implementation of socfpga's
gateclk_ops that was ruining uart output because the divider
was forgotten about"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: stm32: Fix a signedness issue in clk_stm32_composite_determine_rate()
clk: Sanitize possible_parent_show to Handle Return Value of of_clk_get_parent_name
clk: socfpga: gate: Account for the divider in determine_rate
Linus Torvalds [Sat, 28 Oct 2023 02:44:58 +0000 (16:44 -1000)]
Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc filesystem fixes from Al Viro:
"Assorted fixes all over the place: literally nothing in common, could
have been three separate pull requests.
All are simple regression fixes, but not for anything from this cycle"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ceph_wait_on_conflict_unlink(): grab reference before dropping ->d_lock
io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() failed
sparc32: fix a braino in fault handling in csum_and_copy_..._user()
Andrii Nakryiko [Fri, 27 Oct 2023 23:31:26 +0000 (16:31 -0700)]
tracing/kprobes: Fix symbol counting logic by looking at modules as well
Recent changes to count number of matching symbols when creating
a kprobe event failed to take into account kernel modules. As such, it
breaks kprobes on kernel module symbols, by assuming there is no match.
Fix this my calling module_kallsyms_on_each_symbol() in addition to
kallsyms_on_each_match_symbol() to perform a proper counting.
Al Viro [Mon, 28 Aug 2023 22:47:31 +0000 (18:47 -0400)]
io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() failed
->ki_pos value is unreliable in such cases. For an obvious example,
consider O_DSYNC write - we feed the data to page cache and start IO,
then we make sure it's completed. Update of ->ki_pos is dealt with
by the first part; failure in the second ends up with negative value
returned _and_ ->ki_pos left advanced as if sync had been successful.
In the same situation write(2) does not advance the file position
at all.
Linus Torvalds [Sat, 28 Oct 2023 00:10:32 +0000 (14:10 -1000)]
Merge tag 'io_uring-6.6-2023-10-27' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
"Fix for an issue reported where reading fdinfo could find a NULL
thread as we didn't properly synchronize, and then a disable for the
IOCB_DIO_CALLER_COMP optimization as a recent reported highlighted how
that could lead to deadlocks if the task issued async O_DIRECT writes
and then proceeded to do sync fallocate() calls"
* tag 'io_uring-6.6-2023-10-27' of git://git.kernel.dk/linux:
io_uring/rw: disable IOCB_DIO_CALLER_COMP
io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid
Al Viro [Sun, 22 Oct 2023 23:34:28 +0000 (19:34 -0400)]
sparc32: fix a braino in fault handling in csum_and_copy_..._user()
Fault handler used to make non-trivial calls, so it needed
to set a stack frame up. Used to be
save ... - grab a stack frame, old %o... become %i...
....
ret - go back to address originally in %o7, currently %i7
restore - switch to previous stack frame, in delay slot
Non-trivial calls had been gone since ab5e8b331244 and that code should
have become
retl - go back to address in %o7
clr %o0 - have return value set to 0
What it had become instead was
ret - go back to address in %i7 - return address of *caller*
clr %o0 - have return value set to 0
which is not good, to put it mildly - we forcibly return 0 from
csum_and_copy_{from,to}_iter() (which is what the call of that
thing had been inlined into) and do that without dropping the
stack frame of said csum_and_copy_..._iter(). Confuses the
hell out of the caller of csum_and_copy_..._iter(), obviously...
Reviewed-by: Sam Ravnborg <[email protected]> Fixes: ab5e8b331244 "sparc32: propagate the calling conventions change down to __csum_partial_copy_sparc_generic()" Signed-off-by: Al Viro <[email protected]>
Linus Torvalds [Fri, 27 Oct 2023 23:38:59 +0000 (13:38 -1000)]
Merge tag 'ata-6.6-final' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata
Pull ATA fix from Damien Le Moal:
"A single patch to fix a regression introduced by the recent
suspend/resume fixes.
The regression is that ATA disks are not stopped on system shutdown,
which is not recommended and increases the disks SMART counters for
unclean power off events.
This patch fixes this by refining the recent rework of the scsi device
manage_xxx flags"
* tag 'ata-6.6-final' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
scsi: sd: Introduce manage_shutdown device flag
Linus Torvalds [Fri, 27 Oct 2023 23:32:48 +0000 (13:32 -1000)]
Merge tag 'platform-drivers-x86-v6.6-6' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fix from Hans de Goede:
"A single patch to extend the AMD PMC driver DMI quirk list
for laptops which need special handling to avoid NVME s2idle
suspend/resume errors"
* tag 'platform-drivers-x86-v6.6-6' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86: Add s2idle quirk for more Lenovo laptops
The reason is the recent conversion of the TSC synchronization function
during CPU hotplug on the control CPU to a SMP function call. In case
that the synchronization with the upcoming CPU fails, the TSC has to be
marked unstable via clocksource_mark_unstable().
clocksource_mark_unstable() acquires 'watchdog_lock', but that lock is
taken with interrupts enabled in the watchdog timer callback to minimize
interrupt disabled time. That's obviously a possible deadlock scenario,
Before that change the synchronization function was invoked in thread
context so this could not happen.
As it is not crucical whether the unstable marking happens slightly
delayed, defer the call to a worker thread which avoids the lock context
problem.
Thomas Gleixner [Wed, 25 Oct 2023 21:04:15 +0000 (23:04 +0200)]
x86/i8259: Skip probing when ACPI/MADT advertises PCAT compatibility
David and a few others reported that on certain newer systems some legacy
interrupts fail to work correctly.
Debugging revealed that the BIOS of these systems leaves the legacy PIC in
uninitialized state which makes the PIC detection fail and the kernel
switches to a dummy implementation.
Unfortunately this fallback causes quite some code to fail as it depends on
checks for the number of legacy PIC interrupts or the availability of the
real PIC.
In theory there is no reason to use the PIC on any modern system when
IO/APIC is available, but the dependencies on the related checks cannot be
resolved trivially and on short notice. This needs lots of analysis and
rework.
The PIC detection has been added to avoid quirky checks and force selection
of the dummy implementation all over the place, especially in VM guest
scenarios. So it's not an option to revert the relevant commit as that
would break a lot of other scenarios.
One solution would be to try to initialize the PIC on detection fail and
retry the detection, but that puts the burden on everything which does not
have a PIC.
Fortunately the ACPI/MADT table header has a flag field, which advertises
in bit 0 that the system is PCAT compatible, which means it has a legacy
8259 PIC.
Evaluate that bit and if set avoid the detection routine and keep the real
PIC installed, which then gets initialized (for nothing) and makes the rest
of the code with all the dependencies work again.
Linus Torvalds [Fri, 27 Oct 2023 15:40:42 +0000 (05:40 -1000)]
Merge tag 'powerpc-6.6-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix boot crash with FLATMEM since set_ptes() introduction
- Avoid calling arch_enter/leave_lazy_mmu() in set_ptes()
Thanks to Aneesh Kumar K.V and Erhard Furtner.
* tag 'powerpc-6.6-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes
powerpc/mm: Fix boot crash with FLATMEM
David Lazar [Wed, 25 Oct 2023 19:30:16 +0000 (21:30 +0200)]
platform/x86: Add s2idle quirk for more Lenovo laptops
When suspending to idle and resuming on some Lenovo laptops using the
Mendocino APU, multiple NVME IOMMU page faults occur, showing up in
dmesg as repeated errors:
Applying the s2idle quirk introduced by commit 455cd867b85b ("platform/x86:
thinkpad_acpi: Add a s2idle resume quirk for a number of laptops")
allows these systems to work with the IOMMU enabled and s2idle
resume to work.
Yujie Liu [Fri, 27 Oct 2023 04:13:14 +0000 (12:13 +0800)]
tracing/kprobes: Fix the description of variable length arguments
Fix the following kernel-doc warnings:
kernel/trace/trace_kprobe.c:1029: warning: Excess function parameter 'args' description in '__kprobe_event_gen_cmd_start'
kernel/trace/trace_kprobe.c:1097: warning: Excess function parameter 'args' description in '__kprobe_event_add_fields'
Refer to the usage of variable length arguments elsewhere in the kernel
code, "@..." is the proper way to express it in the description.
Lu Baolu [Thu, 26 Oct 2023 08:49:42 +0000 (16:49 +0800)]
iommu: Avoid unnecessary cache invalidations
The iommu_create_device_direct_mappings() only needs to flush the caches
when the mappings are changed in the affected domain. This is not true
for non-DMA domains, or for devices attached to the domain that have no
reserved regions. To avoid unnecessary cache invalidations, add a check
before iommu_flush_iotlb_all().
Linus Torvalds [Fri, 27 Oct 2023 06:42:02 +0000 (20:42 -1000)]
Merge tag 'drm-fixes-2023-10-27' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"This is the final set of fixes for 6.6, just misc bits mainly in
amdgpu and i915, nothing too noteworthy.
amdgpu:
- ignore duplicated BOs in CS parser
- remove redundant call to amdgpu_ctx_priority_is_valid()
- Extend VI APSM quirks to more platforms
amdkfd:
- reserve fence slot while locking BO
dp_mst:
- Fix NULL deref in get_mst_branch_device_by_guid_helper()
logicvc:
- Kconfig: Select REGMAP and REGMAP_MMIO
ivpu:
- Fix missing VPUIP interrupts
i915:
- Determine context valid in OA reports
- Hold GT forcewake during steering operations
- Check if PMU is closed before stopping event"
* tag 'drm-fixes-2023-10-27' of git://anongit.freedesktop.org/drm/drm:
accel/ivpu/37xx: Fix missing VPUIP interrupts
drm/amd: Disable ASPM for VI w/ all Intel systems
drm/i915/pmu: Check if pmu is closed before stopping event
drm/i915/mcr: Hold GT forcewake during steering operations
drm/logicvc: Kconfig: select REGMAP and REGMAP_MMIO
drm/i915/perf: Determine context valid in OA reports
drm/amdkfd: reserve a fence slot while locking the BO
drm/amdgpu: Remove redundant call to priority_is_valid()
drm/dp_mst: Fix NULL deref in get_mst_branch_device_by_guid_helper()
drm/amdgpu: ignore duplicate BOs again
Dave Airlie [Fri, 27 Oct 2023 01:58:28 +0000 (11:58 +1000)]
Merge tag 'drm-intel-fixes-2023-10-26' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
- Determine context valid in OA reports (Umesh)
- Hold GT forcewake during steering operations (Matt Roper)
- Check if PMU is closed before stopping event (Umesh)
Damien Le Moal [Wed, 25 Oct 2023 06:46:12 +0000 (15:46 +0900)]
scsi: sd: Introduce manage_shutdown device flag
Commit aa3998dbeb3a ("ata: libata-scsi: Disable scsi device
manage_system_start_stop") change setting the manage_system_start_stop
flag to false for libata managed disks to enable libata internal
management of disk suspend/resume. However, a side effect of this change
is that on system shutdown, disks are no longer being stopped (set to
standby mode with the heads unloaded). While this is not a critical
issue, this unclean shutdown is not recommended and shows up with
increased smart counters (e.g. the unexpected power loss counter
"Unexpect_Power_Loss_Ct").
Instead of defining a shutdown driver method for all ATA adapter
drivers (not all of them define that operation), this patch resolves
this issue by further refining the sd driver start/stop control of disks
using the new flag manage_shutdown. If this new flag is set to true by
a low level driver, the function sd_shutdown() will issue a
START STOP UNIT command with the start argument set to 0 when a disk
needs to be powered off (suspended) on system power off, that is, when
system_state is equal to SYSTEM_POWER_OFF.
Similarly to the other manage_xxx flags, the new manage_shutdown flag is
exposed through sysfs as a read-write device attribute.
To avoid any confusion between manage_shutdown and
manage_system_start_stop, the comments describing these flags in
include/scsi/scsi.h are also improved.
Linus Torvalds [Thu, 26 Oct 2023 18:17:26 +0000 (08:17 -1000)]
Merge tag 'soc-fixes-6.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"A couple of platforms have some last-minute fixes, in particular:
- riscv gets some fixes for noncoherent DMA on the renesas and thead
platforms and dts fix for SPI on the visionfive 2 board
- Qualcomm Snapdragon gets three dts fixes to address board specific
regressions on the pmic and gpio nodes
- Rockchip platforms get multiple dts fixes to address issues on the
recent rk3399 platform as well as the older rk3128 platform that
apparently regressed a while ago.
- TI OMAP gets some trivial code and dts fixes and a regression fix
for the omap1 ams-delta modem
- NXP i.MX firmware has one fix for a use-after-free but in its error
handling"
* tag 'soc-fixes-6.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (25 commits)
soc: renesas: ARCH_R9A07G043 depends on !RISCV_ISA_ZICBOM
riscv: only select DMA_DIRECT_REMAP from RISCV_ISA_ZICBOM and ERRATA_THEAD_PBMT
riscv: RISCV_NONSTANDARD_CACHE_OPS shouldn't depend on RISCV_DMA_NONCOHERENT
riscv: dts: thead: set dma-noncoherent to soc bus
arm64: dts: rockchip: Fix i2s0 pin conflict on ROCK Pi 4 boards
arm64: dts: rockchip: Add i2s0-2ch-bus-bclk-off pins to RK3399
clk: ti: Fix missing omap5 mcbsp functional clock and aliases
clk: ti: Fix missing omap4 mcbsp functional clock and aliases
ARM: OMAP1: ams-delta: Fix MODEM initialization failure
soc: renesas: Make ARCH_R9A07G043 depend on required options
riscv: dts: starfive: visionfive 2: correct spi's ss pin
firmware/imx-dsp: Fix use_after_free in imx_dsp_setup_channels()
ARM: OMAP: timer32K: fix all kernel-doc warnings
ARM: omap2: fix a debug printk
ARM: dts: rockchip: Fix timer clocks for RK3128
ARM: dts: rockchip: Add missing quirk for RK3128's dma engine
ARM: dts: rockchip: Add missing arm timer interrupt for RK3128
ARM: dts: rockchip: Fix i2c0 register address for RK3128
arm64: dts: rockchip: set codec system-clock-fixed on px30-ringneck-haikou
arm64: dts: rockchip: use codec as clock master on px30-ringneck-haikou
...
Linus Torvalds [Thu, 26 Oct 2023 17:41:27 +0000 (07:41 -1000)]
Merge tag 'net-6.6-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from WiFi and netfilter.
Most regressions addressed here come from quite old versions, with the
exceptions of the iavf one and the WiFi fixes. No known outstanding
reports or investigation.
Fixes to fixes:
- eth: iavf: in iavf_down, disable queues when removing the driver
Previous releases - regressions:
- sched: act_ct: additional checks for outdated flows
- tcp: do not leave an empty skb in write queue
- tcp: fix wrong RTO timeout when received SACK reneging
- wifi: cfg80211: pass correct pointer to rdev_inform_bss()
- eth: i40e: sync next_to_clean and next_to_process for programming
status desc
- eth: iavf: initialize waitqueues before starting watchdog_task
Previous releases - always broken:
- eth: r8169: fix data-races
- eth: igb: fix potential memory leak in igb_add_ethtool_nfc_entry
- eth: r8152: avoid writing garbage to the adapter's registers
- eth: gtp: fix fragmentation needed check with gso"
* tag 'net-6.6-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (43 commits)
iavf: in iavf_down, disable queues when removing the driver
vsock/virtio: initialize the_virtio_vsock before using VQs
net: ipv6: fix typo in comments
net: ipv4: fix typo in comments
net/sched: act_ct: additional checks for outdated flows
netfilter: flowtable: GC pushes back packets to classic path
i40e: Fix wrong check for I40E_TXR_FLAGS_WB_ON_ITR
gtp: fix fragmentation needed check with gso
gtp: uapi: fix GTPA_MAX
Fix NULL pointer dereference in cn_filter()
sfc: cleanup and reduce netlink error messages
net/handshake: fix file ref count in handshake_nl_accept_doit()
wifi: mac80211: don't drop all unprotected public action frames
wifi: cfg80211: fix assoc response warning on failed links
wifi: cfg80211: pass correct pointer to rdev_inform_bss()
isdn: mISDN: hfcsusb: Spelling fix in comment
tcp: fix wrong RTO timeout when received SACK reneging
r8152: Block future register access if register access fails
r8152: Rename RTL8152_UNPLUG to RTL8152_INACCESSIBLE
r8152: Check for unplug in r8153b_ups_en() / r8153c_ups_en()
...
Arnd Bergmann [Thu, 26 Oct 2023 15:06:37 +0000 (17:06 +0200)]
Merge tag 'renesas-fixes-for-v6.6-tag3' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-devel into arm/fixes
Renesas fixes for v6.6 (take three)
- Sort out a few Kconfig dependency issues for the rich set of RISC-V
non-coherent DMA support.
* tag 'renesas-fixes-for-v6.6-tag3' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-devel:
soc: renesas: ARCH_R9A07G043 depends on !RISCV_ISA_ZICBOM
riscv: only select DMA_DIRECT_REMAP from RISCV_ISA_ZICBOM and ERRATA_THEAD_PBMT
riscv: RISCV_NONSTANDARD_CACHE_OPS shouldn't depend on RISCV_DMA_NONCOHERENT
soc: renesas: ARCH_R9A07G043 depends on !RISCV_ISA_ZICBOM
ARCH_R9A07G043 has its own non-standard global pool based DMA coherent
allocator, which conflicts with the remap based RISCV_ISA_ZICBOM version.
Add a proper dependency.
Karol Wachowski [Tue, 24 Oct 2023 16:19:52 +0000 (18:19 +0200)]
accel/ivpu/37xx: Fix missing VPUIP interrupts
Move sequence of masking and unmasking global interrupts from buttress
interrupt handler to generic one that handles both VPUIP and BTRS
interrupts. Unmasking global interrupts will re-trigger MSI for any
pending interrupts.
Lack of this sequence will cause the driver to miss any
VPUIP interrupt that comes after reading VPU_37XX_HOST_SS_ICB_STATUS_0
and before clearing all active interrupt sources.
Michal Schmidt [Wed, 25 Oct 2023 18:32:13 +0000 (11:32 -0700)]
iavf: in iavf_down, disable queues when removing the driver
In iavf_down, we're skipping the scheduling of certain operations if
the driver is being removed. However, the IAVF_FLAG_AQ_DISABLE_QUEUES
request must not be skipped in this case, because iavf_close waits
for the transition to the __IAVF_DOWN state, which happens in
iavf_virtchnl_completion after the queues are released.
Without this fix, "rmmod iavf" takes half a second per interface that's
up and prints the "Device resources not yet released" warning.
Jakub Kicinski [Wed, 25 Oct 2023 23:02:06 +0000 (16:02 -0700)]
Merge tag 'nf-23-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
This patch contains two late Netfilter's flowtable fixes for net:
1) Flowtable GC pushes back packets to classic path in every GC run,
ie. every second. This is because NF_FLOW_HW_ESTABLISHED is only
used by sched/act_ct (never set) and IPS_SEEN_REPLY might be unset
by the time the flow is offloaded (this status bit is only reliable
in the sched/act_ct datapath).
2) sched/act_ct logic to push back packets to classic path to reevaluate
if UDP flow is unidirectional only applies if IPS_HW_OFFLOAD_BIT is
set on and no hardware offload request is pending to be handled.
From Vlad Buslov.
These two patches fixes two problems that were introduced in the
previous 6.5 development cycle.
* tag 'nf-23-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
net/sched: act_ct: additional checks for outdated flows
netfilter: flowtable: GC pushes back packets to classic path
====================
Alexandru Matei [Tue, 24 Oct 2023 19:17:42 +0000 (22:17 +0300)]
vsock/virtio: initialize the_virtio_vsock before using VQs
Once VQs are filled with empty buffers and we kick the host, it can send
connection requests. If the_virtio_vsock is not initialized before,
replies are silently dropped and do not reach the host.
virtio_transport_send_pkt() can queue packets once the_virtio_vsock is
set, but they won't be processed until vsock->tx_run is set to true. We
queue vsock->send_pkt_work when initialization finishes to send those
packets queued earlier.
Marc Zyngier [Tue, 24 Oct 2023 14:34:31 +0000 (15:34 +0100)]
irqchip/gic-v3-its: Don't override quirk settings with default values
When splitting the allocation of the ITS node from its configuration,
some of the default settings were kept in the latter instead of
being moved to the former.
This has the side effect of negating some of the quirk detections that
have happened in between, amongst which the dreaded Synquacer hack
(that also affect Dominic's TI platform).
Move the initialisation of these fields early, so that they can again be
overriden by the Synquacer quirk.
Petr Tesarik [Wed, 25 Oct 2023 08:44:25 +0000 (10:44 +0200)]
swiotlb: do not try to allocate a TLB bigger than MAX_ORDER pages
When allocating a new pool at runtime, reduce the number of slabs so
that the allocation order is at most MAX_ORDER. This avoids a kernel
warning in __alloc_pages().
The warning is relatively benign, because the pool size is subsequently
reduced when allocation fails, but it is silly to start with a request
that is known to fail, especially since this is the default behavior if
the kernel is built with CONFIG_SWIOTLB_DYNAMIC=y and booted without any
swiotlb= parameter.
Reported-by: Ben Greear <[email protected]> Closes: https://lore.kernel.org/netdev/[email protected]/ Fixes: 1aaa736815eb ("swiotlb: allocate a new memory pool when existing pools are full") Signed-off-by: Petr Tesarik <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
Jens Axboe [Tue, 24 Oct 2023 20:39:06 +0000 (14:39 -0600)]
io_uring/rw: disable IOCB_DIO_CALLER_COMP
If an application does O_DIRECT writes with io_uring and the file system
supports IOCB_DIO_CALLER_COMP, then completions of the dio write side is
done from the task_work that will post the completion event for said
write as well.
Whenever a dio write is done against a file, the inode i_dio_count is
elevated. This enables other callers to use inode_dio_wait() to wait for
previous writes to complete. If we defer the full dio completion to
task_work, we are dependent on that task_work being run before the
inode i_dio_count can be decremented.
If the same task that issues io_uring dio writes with
IOCB_DIO_CALLER_COMP performs a synchronous system call that calls
inode_dio_wait(), then we can deadlock as we're blocked sleeping on
the event to become true, but not processing the completions that will
result in the inode i_dio_count being decremented.
Until we can guarantee that this is the case, then disable the deferred
caller completions.
Fixes: 099ada2c8726 ("io_uring/rw: add write support for IOCB_DIO_CALLER_COMP") Reported-by: Andres Freund <[email protected]> Signed-off-by: Jens Axboe <[email protected]>
Jens Axboe [Sat, 21 Oct 2023 18:30:29 +0000 (12:30 -0600)]
io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid
We could race with SQ thread exit, and if we do, we'll hit a NULL pointer
dereference when the thread is cleared. Grab the SQPOLL data lock before
attempting to get the task cpu and pid for fdinfo, this ensures we have a
stable view of it.
drm/i915/pmu: Check if pmu is closed before stopping event
When the driver unbinds, pmu is unregistered and i915->uabi_engines is
set to RB_ROOT. Due to this, when i915 PMU tries to stop the engine
events, it issues a warn_on because engine lookup fails.
All perf hooks are taking care of this using a pmu->closed flag that is
set when PMU unregisters. The stop event seems to have been left out.
Check for pmu->closed in pmu_event_stop as well.
Based on discussion here -
https://patchwork.freedesktop.org/patch/492079/?series=105790&rev=2
v2: s/is/if/ in commit title
v3: Add fixes tag and cc stable
Matt Roper [Thu, 19 Oct 2023 17:02:42 +0000 (10:02 -0700)]
drm/i915/mcr: Hold GT forcewake during steering operations
The steering control and semaphore registers are inside an "always on"
power domain with respect to RC6. However there are some issues if
higher-level platform sleep states are entering/exiting at the same time
these registers are accessed. Grabbing GT forcewake and holding it over
the entire lock/steer/unlock cycle ensures that those sleep states have
been fully exited before we access these registers.
This is expected to become a formally documented/numbered workaround
soon.
Note that this patch alone isn't expected to have an immediately
noticeable impact on MCR (mis)behavior; an upcoming pcode firmware
update will also be necessary to provide the other half of this
workaround.
v2:
- Move the forcewake inside the Xe_LPG-specific IP version check. This
should only be necessary on platforms that have a steering semaphore.
Sui Jingfeng [Thu, 8 Jun 2023 02:42:07 +0000 (10:42 +0800)]
drm/logicvc: Kconfig: select REGMAP and REGMAP_MMIO
drm/logicvc driver is depend on REGMAP and REGMAP_MMIO, should select this
two kconfig option, otherwise the driver failed to compile on platform
without REGMAP_MMIO selected:
Vlad Buslov [Tue, 24 Oct 2023 19:58:57 +0000 (21:58 +0200)]
net/sched: act_ct: additional checks for outdated flows
Current nf_flow_is_outdated() implementation considers any flow table flow
which state diverged from its underlying CT connection status for teardown
which can be problematic in the following cases:
- Flow has never been offloaded to hardware in the first place either
because flow table has hardware offload disabled (flag
NF_FLOWTABLE_HW_OFFLOAD is not set) or because it is still pending on 'add'
workqueue to be offloaded for the first time. The former is incorrect, the
later generates excessive deletions and additions of flows.
- Flow is already pending to be updated on the workqueue. Tearing down such
flows will also generate excessive removals from the flow table, especially
on highly loaded system where the latency to re-offload a flow via 'add'
workqueue can be quite high.
When considering a flow for teardown as outdated verify that it is both
offloaded to hardware and doesn't have any pending updates.
Fixes: 41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded unreplied tuple") Reviewed-by: Paul Blakey <[email protected]> Signed-off-by: Vlad Buslov <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
netfilter: flowtable: GC pushes back packets to classic path
Since 41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded
unreplied tuple"), flowtable GC pushes back flows with IPS_SEEN_REPLY
back to classic path in every run, ie. every second. This is because of
a new check for NF_FLOW_HW_ESTABLISHED which is specific of sched/act_ct.
In Netfilter's flowtable case, NF_FLOW_HW_ESTABLISHED never gets set on
and IPS_SEEN_REPLY is unreliable since users decide when to offload the
flow before, such bit might be set on at a later stage.
Fix it by adding a custom .gc handler that sched/act_ct can use to
deal with its NF_FLOW_HW_ESTABLISHED bit.
Fixes: 41f2c7c342d3 ("net/sched: act_ct: Fix promotion of offloaded unreplied tuple") Reported-by: Vladimir Smelhaus <[email protected]> Reviewed-by: Paul Blakey <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>
powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes
With commit 9fee28baa601 ("powerpc: implement the new page table range
API") we added set_ptes to powerpc architecture. The implementation
included calling arch_enter/leave_lazy_mmu() calls.
The patch removes the usage of arch_enter/leave_lazy_mmu() because
set_pte is not supposed to be used when updating a pte entry. Powerpc
architecture uses this rule to skip the expensive tlb invalidate which
is not needed when you are setting up the pte for the first time. See
commit 56eecdb912b5 ("mm: Use ptep/pmdp_set_numa() for updating
_PAGE_NUMA bit") for more details
The patch also makes sure we are not using the interface to update a
valid/present pte entry by adding VM_WARN_ON check all the ptes we
are setting up. Furthermore, we add a comment to set_pte_filter to
clarify it can only update folio-related flags and cannot filter
pfn specific details in pte filtering.
Removal of arch_enter/leave_lazy_mmu() also will avoid nesting of
these functions that are not supported. For ex:
Jakub Kicinski [Tue, 24 Oct 2023 20:10:53 +0000 (13:10 -0700)]
Merge tag 'wireless-2023-10-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Johannes Berg says:
====================
Three more fixes:
- don't drop all unprotected public action frames since
some don't have a protected dual
- fix pointer confusion in scanning code
- fix warning in some connections with multiple links
* tag 'wireless-2023-10-24' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: mac80211: don't drop all unprotected public action frames
wifi: cfg80211: fix assoc response warning on failed links
wifi: cfg80211: pass correct pointer to rdev_inform_bss()
====================
Linus Torvalds [Tue, 24 Oct 2023 19:52:16 +0000 (09:52 -1000)]
Merge tag 'mm-hotfixes-stable-2023-10-24-09-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"20 hotfixes. 12 are cc:stable and the remainder address post-6.5
issues or aren't considered necessary for earlier kernel versions"
* tag 'mm-hotfixes-stable-2023-10-24-09-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
maple_tree: add GFP_KERNEL to allocations in mas_expected_entries()
selftests/mm: include mman header to access MREMAP_DONTUNMAP identifier
mailmap: correct email aliasing for Oleksij Rempel
mailmap: map Bartosz's old address to the current one
mm/damon/sysfs: check DAMOS regions update progress from before_terminate()
MAINTAINERS: Ondrej has moved
kasan: disable kasan_non_canonical_hook() for HW tags
kasan: print the original fault addr when access invalid shadow
hugetlbfs: close race between MADV_DONTNEED and page fault
hugetlbfs: extend hugetlb_vma_lock to private VMAs
hugetlbfs: clear resv_map pointer if mmap fails
mm: zswap: fix pool refcount bug around shrink_worker()
mm/migrate: fix do_pages_move for compat pointers
riscv: fix set_huge_pte_at() for NAPOT mappings when a swap entry is set
riscv: handle VM_FAULT_[HWPOISON|HWPOISON_LARGE] faults instead of panicking
mmap: fix error paths with dup_anon_vma()
mmap: fix vma_iterator in error path of vma_merge()
mm: fix vm_brk_flags() to not bail out while holding lock
mm/mempolicy: fix set_mempolicy_home_node() previous VMA pointer
mm/page_alloc: correct start page when guard page debug is enabled
Jinjie Ruan [Mon, 23 Oct 2023 03:28:57 +0000 (11:28 +0800)]
fpga: Fix memory leak for fpga_region_test_class_find()
fpga_region_class_find() in fpga_region_test_class_find() will call
get_device() if the data is matched, which will increment refcount for
dev->kobj, so it should call put_device() to decrement refcount for
dev->kobj to free the region, because fpga_region_unregister() will call
fpga_region_dev_release() only when the refcount for dev->kobj is zero
but fpga_region_test_init() call device_register() in
fpga_region_register_full(), which also increment refcount.
So call put_device() after calling fpga_region_class_find() in
fpga_region_test_class_find(). After applying this patch, the following
memory leak is never detected.
Russ Weight [Mon, 23 Oct 2023 03:28:56 +0000 (11:28 +0800)]
fpga: m10bmc-sec: Change contact for secure update driver
Change the maintainer for the Intel MAX10 BMC Secure Update driver from
Russ Weight to Peter Colberg. Update the ABI documentation contact
information as well.
drm/i915/perf: Determine context valid in OA reports
When supporting OA for TGL, it was seen that the context valid bit in
the report ID was not defined, however revisiting the spec seems to have
this bit defined. The bit is used to determine if a context is valid on
a context switch and is essential to determine active and idle periods
for a context. Re-enable the context valid bit for gen12 platforms.
Paolo Abeni [Tue, 24 Oct 2023 10:02:03 +0000 (12:02 +0200)]
Merge branch 'gtp-tunnel-driver-fixes'
Pablo Neira Ayuso says:
====================
GTP tunnel driver fixes
The following patchset contains two fixes for the GTP tunnel driver:
1) Incorrect GTPA_MAX definition in UAPI headers. This is updating an
existing UAPI definition but for a good reason, this is certainly
broken. Similar fixes for incorrect _MAX definition in netlink
headers were applied in the past too.
2) Fix GTP driver PMTU with GRO packets, add missing call to
skb_gso_validate_network_len() to handle GRO packets.
====================
Anjali Kulkarni [Fri, 20 Oct 2023 23:40:58 +0000 (16:40 -0700)]
Fix NULL pointer dereference in cn_filter()
Check that sk_user_data is not NULL, else return from cn_filter().
Could not reproduce this issue, but Oliver Sang verified it has fixed
the "Closes" problem below.
Linus Torvalds [Tue, 24 Oct 2023 06:40:04 +0000 (20:40 -1000)]
Merge tag 'pull-nfsd-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull nfsd fix from Al Viro:
"Catch from lock_rename() audit; nfsd_rename() checked that both
directories belonged to the same filesystem, but only after having
done lock_rename().
Trivial fix, tested and acked by nfs folks"
* tag 'pull-nfsd-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
nfsd: lock_rename() needs both directories to live on the same fs
Linus Torvalds [Tue, 24 Oct 2023 00:19:11 +0000 (14:19 -1000)]
Merge tag 'urgent/nolibc.2023.10.16a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull nolibc fixes from Paul McKenney:
- tools/nolibc: i386: Fix a stack misalign bug on _start
- MAINTAINERS: nolibc: update tree location
- tools/nolibc: mark start_c as weak to avoid linker errors
* tag 'urgent/nolibc.2023.10.16a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
tools/nolibc: mark start_c as weak
MAINTAINERS: nolibc: update tree location
tools/nolibc: i386: Fix a stack misalign bug on _start
Reduce the length of netlink error messages as they are likely to be
truncated anyway. Additionally, reword netlink error messages so they
are more consistent with previous messages.
Arnd Bergmann [Mon, 23 Oct 2023 19:01:45 +0000 (21:01 +0200)]
Merge tag 'mvebu-fixes-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/gclement/mvebu into arm/fixes
mvebu fixes for 6.6 (part 1)
Update MAINTAINERS for eDPU board
* tag 'mvebu-fixes-6.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/gclement/mvebu:
MAINTAINERS: uDPU: add remaining Methode boards
MAINTAINERS: uDPU: make myself maintainer of it
Linus Torvalds [Mon, 23 Oct 2023 17:59:13 +0000 (07:59 -1000)]
Merge tag 'for-6.6-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One more fix for a problem with snapshot of a newly created subvolume
that can lead to inconsistent data under some circumstances. Kernel
6.5 added a performance optimization to skip transaction commit for
subvolume creation but this could end up with newer data on disk but
not linked to other structures.
The fix itself is an added condition, the rest of the patch is a
parameter added to several functions"
* tag 'for-6.6-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix unwritten extent buffer after snapshotting a new subvolume
Linus Torvalds [Mon, 23 Oct 2023 17:42:48 +0000 (07:42 -1000)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio fixes from Michael Tsirkin:
"A collection of small fixes that look like worth having in this
release"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio_pci: fix the common cfg map size
virtio-crypto: handle config changed by work queue
vhost: Allow null msg.size on VHOST_IOTLB_INVALIDATE
vdpa/mlx5: Fix firmware error on creation of 1k VQs
virtio_balloon: Fix endless deflation and inflation on arm64
vdpa/mlx5: Fix double release of debugfs entry
virtio-mmio: fix memory leak of vm_dev
vdpa_sim_blk: Fix the potential leak of mgmt_dev
tools/virtio: Add dma sync api for virtio test
net/handshake: fix file ref count in handshake_nl_accept_doit()
If req->hr_proto->hp_accept() fail, we call fput() twice:
Once in the error path, but also a second time because sock->file
is at that point already associated with the file descriptor. Once
the task exits, as it would probably do after receiving an error
reading from netlink, the fd is closed, calling fput() a second time.
To fix, we move installing the file after the error path for the
hp_accept() call. In the case of errors we simply put the unused fd.
In case of success we can use fd_install() to link the sock->file
to the reserved fd.
Filipe Manana [Thu, 19 Oct 2023 12:19:28 +0000 (13:19 +0100)]
btrfs: fix unwritten extent buffer after snapshotting a new subvolume
When creating a snapshot of a subvolume that was created in the current
transaction, we can end up not persisting a dirty extent buffer that is
referenced by the snapshot, resulting in IO errors due to checksum failures
when trying to read the extent buffer later from disk. A sequence of steps
that leads to this is the following:
1) At ioctl.c:create_subvol() we allocate an extent buffer, with logical
address 36007936, for the leaf/root of a new subvolume that has an ID
of 291. We mark the extent buffer as dirty, and at this point the
subvolume tree has a single node/leaf which is also its root (level 0);
2) We no longer commit the transaction used to create the subvolume at
create_subvol(). We used to, but that was recently removed in
commit 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol
create");
3) The transaction used to create the subvolume has an ID of 33, so the
extent buffer 36007936 has a generation of 33;
4) Several updates happen to subvolume 291 during transaction 33, several
files created and its tree height changes from 0 to 1, so we end up with
a new root at level 1 and the extent buffer 36007936 is now a leaf of
that new root node, which is extent buffer 36048896.
The commit root remains as 36007936, since we are still at transaction
33;
5) Creation of a snapshot of subvolume 291, with an ID of 292, starts at
ioctl.c:create_snapshot(). This triggers a commit of transaction 33 and
we end up at transaction.c:create_pending_snapshot(), in the critical
section of a transaction commit.
There we COW the root of subvolume 291, which is extent buffer 36048896.
The COW operation returns extent buffer 36048896, since there's no need
to COW because the extent buffer was created in this transaction and it
was not written yet.
The we call btrfs_copy_root() against the root node 36048896. During
this operation we allocate a new extent buffer to turn into the root
node of the snapshot, copy the contents of the root node 36048896 into
this snapshot root extent buffer, set the owner to 292 (the ID of the
snapshot), etc, and then we call btrfs_inc_ref(). This will create a
delayed reference for each leaf pointed by the root node with a
reference root of 292 - this includes a reference for the leaf 36007936.
After that we set the bit BTRFS_ROOT_FORCE_COW in the root's state.
Then we call btrfs_insert_dir_item(), to create the directory entry in
in the tree of subvolume 291 that points to the snapshot. This ends up
needing to modify leaf 36007936 to insert the respective directory
items. Because the bit BTRFS_ROOT_FORCE_COW is set for the root's state,
we need to COW the leaf. We end up at btrfs_force_cow_block() and then
at update_ref_for_cow().
At update_ref_for_cow() we call btrfs_block_can_be_shared() which
returns false, despite the fact the leaf 36007936 is shared - the
subvolume's root and the snapshot's root point to that leaf. The
reason that it incorrectly returns false is because the commit root
of the subvolume is extent buffer 36007936 - it was the initial root
of the subvolume when we created it. So btrfs_block_can_be_shared()
which has the following logic:
Returns false (0) since 'buf' (extent buffer 36007936) matches the
root's commit root.
As a result, at update_ref_for_cow(), we don't check for the number
of references for extent buffer 36007936, we just assume it's not
shared and therefore that it has only 1 reference, so we set the local
variable 'refs' to 1.
Later on, in the final if-else statement at update_ref_for_cow():
So we mark the extent buffer 36007936 as not dirty, and as a result
we don't write it to disk later in the transaction commit, despite the
fact that the snapshot's root points to it.
Attempting to access the leaf or dumping the tree for example shows
that the extent buffer was not written:
$ btrfs inspect-internal dump-tree -t 292 /dev/sdb
btrfs-progs v6.2.2
file tree key (292 ROOT_ITEM 33)
node 36110336 level 1 items 2 free space 119 generation 33 owner 292
node 36110336 flags 0x1(WRITTEN) backref revision 1
checksum stored a8103e3e
checksum calced a8103e3e
fs uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
chunk uuid e8c9c885-78f4-4d31-85fe-89e5f5fd4a07
key (256 INODE_ITEM 0) block 36007936 gen 33
key (257 EXTENT_DATA 0) block 36052992 gen 33
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
total bytes 107374182400
bytes used 38572032
uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
The respective on disk region is full of zeroes as the device was
trimmed at mkfs time.
Obviously 'btrfs check' also detects and complains about this:
$ btrfs check /dev/sdb
Opening filesystem to check...
Checking filesystem on /dev/sdb
UUID: 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
generation: 33 (33)
[1/7] checking root items
[2/7] checking extents
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
bad tree block 36007936, bytenr mismatch, want=36007936, have=0
owner ref check failed [36007936 4096]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
[4/7] checking fs roots
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
bad tree block 36007936, bytenr mismatch, want=36007936, have=0
The following tree block(s) is corrupted in tree 292:
tree block bytenr: 36110336, level: 1, node key: (256, 1, 0)
root 292 root dir 256 not found
ERROR: errors found in fs roots
found 38572032 bytes used, error(s) found
total csum bytes: 16048
total tree bytes: 1265664
total fs tree bytes: 1118208
total extent tree bytes: 65536
btree space waste bytes: 562598
file data blocks allocated: 65978368
referenced 36569088
Fix this by updating btrfs_block_can_be_shared() to consider that an
extent buffer may be shared if it matches the commit root and if its
generation matches the current transaction's generation.
This can be reproduced with the following script:
$ cat test.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
# Use a filesystem with a 64K node size so that we have the same node
# size on every machine regardless of its page size (on x86_64 default
# node size is 16K due to the 4K page size, while on PPC it's 64K by
# default). This way we can make sure we are able to create a btree for
# the subvolume with a height of 2.
mkfs.btrfs -f -n 64K $DEV
mount $DEV $MNT
btrfs subvolume create $MNT/subvol
# Create a few empty files on the subvolume, this bumps its btree
# height to 2 (root node at level 1 and 2 leaves).
for ((i = 1; i <= 300; i++)); do
echo -n > $MNT/subvol/file_$i
done
Running it on a 6.5 kernel (or any 6.6-rc kernel at the moment):
$ ./test.sh
Create subvolume '/mnt/sdi/subvol'
Create a readonly snapshot of '/mnt/sdi/subvol' in '/mnt/sdi/subvol/snap'
Opening filesystem to check...
Checking filesystem on /dev/sdi
UUID: bbdde2ff-7d02-45ca-8a73-3c36f23755a1
[1/7] checking root items
[2/7] checking extents
parent transid verify failed on 30539776 wanted 7 found 5
parent transid verify failed on 30539776 wanted 7 found 5
parent transid verify failed on 30539776 wanted 7 found 5
Ignoring transid failure
owner ref check failed [30539776 65536]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
[4/7] checking fs roots
parent transid verify failed on 30539776 wanted 7 found 5
Ignoring transid failure
Wrong key of child node/leaf, wanted: (256, 1, 0), have: (2, 132, 0)
Wrong generation of child node/leaf, wanted: 5, have: 7
root 257 root dir 256 not found
ERROR: errors found in fs roots
found 917504 bytes used, error(s) found
total csum bytes: 0
total tree bytes: 851968
total fs tree bytes: 393216
total extent tree bytes: 65536
btree space waste bytes: 736550
file data blocks allocated: 0
referenced 0
A bisect pointed to the breakage beginning with commit 9fee28baa601 ("powerpc:
implement the new page table range API").
Analysis of the oops pointed to a struct page with a corrupted
compound_head being loaded via page_folio() -> _compound_head() in
hash_page_do_lazy_icache().
The access by the mpic code is to an MMIO address, so the expectation
is that the struct page for that address would be initialised by
init_unavailable_range(), as pointed out by Aneesh.
Instrumentation showed that was not the case, which eventually lead to
the realisation that pfn_valid() was returning false for that address,
causing the struct page to not be initialised.
Because the system is using FLATMEM, the version of pfn_valid() in
memory_model.h is used:
static inline int pfn_valid(unsigned long pfn)
{
...
return pfn >= pfn_offset && (pfn - pfn_offset) < max_mapnr;
}
Which relies on max_mapnr being initialised. Early in boot max_mapnr is
zero meaning no PFNs are valid.
max_mapnr is initialised in mem_init() called via:
Although max_mapnr is currently set in mem_init(), the value is actually
already available much earlier, as soon as mem_topology_setup() has
completed, which is also before paging_init() is called. So move the
initialisation there, which causes paging_init() to correctly initialise
the struct page and fixes the bug.
This bug seems to have been lurking for years, but went unnoticed
because the pre-folio code was inspecting the uninitialised page->flags
but not dereferencing it.
Johannes Berg [Wed, 18 Oct 2023 09:42:51 +0000 (11:42 +0200)]
wifi: cfg80211: fix assoc response warning on failed links
The warning here shouldn't be done before we even set the
bss field (or should've used the input data). Move the
assignment before the warning to fix it.
We noticed this now because of Wen's bugfix, where the bug
fixed there had previously hidden this other bug.
Fixes: 53ad07e9823b ("wifi: cfg80211: support reporting failed links") Signed-off-by: Johannes Berg <[email protected]>
Linus Torvalds [Sun, 22 Oct 2023 17:05:28 +0000 (07:05 -1000)]
Merge tag 'efi-fixes-for-v6.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fixes from Ard Biesheuvel:
"The boot_params pointer fix uses a somewhat ugly extern struct
declaration but this will be cleaned up the next cycle.
- don't try to print warnings to the console when it is no longer
available
- fix theoretical memory leak in SSDT override handling
- make sure that the boot_params global variable is set before the
KASLR code attempts to hash it for 'randomness'
- avoid soft lockups in the memory acceptance code"
* tag 'efi-fixes-for-v6.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi/unaccepted: Fix soft lockups caused by parallel memory acceptance
x86/boot: efistub: Assign global boot_params variable
efi: fix memory leak in krealloc failure handling
x86/efistub: Don't try to print after ExitBootService()
Fred Chen [Sat, 21 Oct 2023 00:19:47 +0000 (08:19 +0800)]
tcp: fix wrong RTO timeout when received SACK reneging
This commit fix wrong RTO timeout when received SACK reneging.
When an ACK arrived pointing to a SACK reneging, tcp_check_sack_reneging()
will rearm the RTO timer for min(1/2*srtt, 10ms) into to the future.
But since the commit 62d9f1a6945b ("tcp: fix TLP timer not set when
CA_STATE changes from DISORDER to OPEN") merged, the tcp_set_xmit_timer()
is moved after tcp_fastretrans_alert()(which do the SACK reneging check),
so the RTO timeout will be overwrited by tcp_set_xmit_timer() with
icsk_rto instead of 1/2*srtt.
Here is a packetdrill script to check this bug:
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
// we expect rto fired in 1/2*srtt (50ms)
+.05 > . 1001:2001(1000) ack 1
This fix remove the FLAG_SET_XMIT_TIMER from ack_flag when
tcp_check_sack_reneging() set RTO timer with 1/2*srtt to avoid
being overwrited later.
Fixes: 62d9f1a6945b ("tcp: fix TLP timer not set when CA_STATE changes from DISORDER to OPEN") Signed-off-by: Fred Chen <[email protected]> Reviewed-by: Neal Cardwell <[email protected]> Tested-by: Neal Cardwell <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Xiang Chen [Thu, 19 Oct 2023 13:01:21 +0000 (21:01 +0800)]
ACPI: NFIT: Install Notify() handler before getting NFIT table
If there is no NFIT at startup, it will return 0 immediately in function
acpi_nfit_add() and will not install Notify() handler. If hotplugging
a nvdimm device later, it will not be identified as there is no Notify()
handler.
Install the handler before getting NFI table in function acpi_nfit_add()
to avoid above issue.
Fixes: dcca12ab62a2 ("ACPI: NFIT: Install Notify() handler directly") Signed-off-by: Xiang Chen <[email protected]>
[ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <[email protected]>
David S. Miller [Sun, 22 Oct 2023 10:46:18 +0000 (11:46 +0100)]
Merge branch 'r8152-reg-garbage'
Douglas Anderson says:
====================
r8152: Avoid writing garbage to the adapter's registers
This series is the result of a cooperative debug effort between
Realtek and the ChromeOS team. On ChromeOS, we've noticed that Realtek
Ethernet adapters can sometimes get so wedged that even a reboot of
the host can't get them to enumerate again, assuming that the adapter
was on a powered hub and din't lose power when the host rebooted. This
is sometimes seen in the ChromeOS automated testing lab. The only way
to recover adapters in this state is to manually power cycle them.
I managed to reproduce one instance of this wedging (unknown if this
is truly related to what the test lab sees) by doing this:
1. Start a flood ping from a host to the device.
2. Drop the device into kdb.
3. Wait 90 seconds.
4. Resume from kdb (the "g" command).
5. Wait another 45 seconds.
Upon analysis, Realtek realized this was happening:
1. The Linux driver was getting a "Tx timeout" after resuming from kdb
and then trying to reset itself.
2. As part of the reset, the Linux driver was attempting to do a
read-modify-write of the adapter's registers.
3. The read would fail (due to a timeout) and the driver pretended
that the register contained all 0xFFs. See commit f53a7ad18959
("r8152: Set memory to all 0xFFs on failed reg reads")
4. The driver would take this value of all 0xFFs, modify it, and
attempt to write it back to the adapter.
5. By this time the USB channel seemed to recover and thus we'd
successfully write a value that was mostly 0xFFs to the adpater.
6. The adapter didn't like this and would wedge itself.
Another Engineer also managed to reproduce wedging of the Realtek
Ethernet adpater during a reboot test on an AMD Chromebook. In that
case he was sometimes seeing -EPIPE returned from the control
transfers.
This patch series fixes both issues.
Changes in v5:
- ("Run the unload routine if we have errors during probe") new for v5.
- ("Cancel hw_phy_work if we have an error in probe") new for v5.
- ("Release firmware if we have an error in probe") new for v5.
- Removed extra mutex_unlock() left over in v4.
- Fixed minor typos.
- Don't do queue an unbind/bind reset if probe fails; just retry probe.
Changes in v4:
- Took out some unnecessary locks/unlocks of the control mutex.
- Added comment about reading version causing probe fail if 3 fails.
- Added text to commit msg about the potential unbind/bind loop.
Changes in v3:
- Fixed v2 changelog ending up in the commit message.
- farmework -> framework in comments.
Changes in v2:
- ("Check for unplug in rtl_phy_patch_request()") new for v2.
- ("Check for unplug in r8153b_ups_en() / r8153c_ups_en()") new for v2.
- ("Rename RTL8152_UNPLUG to RTL8152_INACCESSIBLE") new for v2.
- Reset patch no longer based on retry patch, since that was dropped.
- Reset patch should be robust even if failures happen in probe.
- Switched booleans to bits in the "flags" variable.
- Check for -ENODEV instead of "udev->state == USB_STATE_NOTATTACHED"
====================
Douglas Anderson [Fri, 20 Oct 2023 21:06:59 +0000 (14:06 -0700)]
r8152: Block future register access if register access fails
Even though the functions to read/write registers can fail, most of
the places in the r8152 driver that read/write register values don't
check error codes. The lack of error code checking is problematic in
at least two ways.
The first problem is that the r8152 driver often uses code patterns
similar to this:
x = read_register()
x = x | SOME_BIT;
write_register(x);
...with the above pattern, if the read_register() fails and returns
garbage then we'll end up trying to write modified garbage back to the
Realtek adapter. If the write_register() succeeds that's bad. Note
that as of commit f53a7ad18959 ("r8152: Set memory to all 0xFFs on
failed reg reads") the "garbage" returned by read_register() will at
least be consistent garbage, but it is still garbage.
It turns out that this problem is very serious. Writing garbage to
some of the hardware registers on the Ethernet adapter can put the
adapter in such a bad state that it needs to be power cycled (fully
unplugged and plugged in again) before it can enumerate again.
The second problem is that the r8152 driver generally has functions
that are long sequences of register writes. Assuming everything will
be OK if a random register write fails in the middle isn't a great
assumption.
One might wonder if the above two problems are real. You could ask if
we would really have a successful write after a failed read. It turns
out that the answer appears to be "yes, this can happen". In fact,
we've seen at least two distinct failure modes where this happens.
On a sc7180-trogdor Chromebook if you drop into kdb for a while and
then resume, you can see:
1. We get a "Tx timeout"
2. The "Tx timeout" queues up a USB reset.
3. In rtl8152_pre_reset() we try to reinit the hardware.
4. The first several (2-9) register accesses fail with a timeout, then
things recover.
The above test case was actually fixed by the patch ("r8152: Increase
USB control msg timeout to 5000ms as per spec") but at least shows
that we really can see successful calls after failed ones.
On a different (AMD) based Chromebook with a particular adapter, we
found that during reboot tests we'd also sometimes get a transitory
failure. In this case we saw -EPIPE being returned sometimes. Retrying
worked, but retrying is not always safe for all register accesses
since reading/writing some registers might have side effects (like
registers that clear on read).
Let's fully lock out all register access if a register access fails.
When we do this, we'll try to queue up a USB reset and try to unlock
register access after the reset. This is slightly tricker than it
sounds since the r8152 driver has an optimized reset sequence that
only works reliably after probe happens. In order to handle this, we
avoid the optimized reset if probe didn't finish. Instead, we simply
retry the probe routine in this case.
When locking out access, we'll use the existing infrastructure that
the driver was using when it detected we were unplugged. This keeps us
from getting stuck in delay loops in some parts of the driver.
Douglas Anderson [Fri, 20 Oct 2023 21:06:58 +0000 (14:06 -0700)]
r8152: Rename RTL8152_UNPLUG to RTL8152_INACCESSIBLE
Whenever the RTL8152_UNPLUG is set that just tells the driver that all
accesses will fail and we should just immediately bail. A future patch
will use this same concept at a time when the driver hasn't actually
been unplugged but is about to be reset. Rename the flag in
preparation for the future patch.
This is a no-op change and just a search and replace.
Douglas Anderson [Fri, 20 Oct 2023 21:06:57 +0000 (14:06 -0700)]
r8152: Check for unplug in r8153b_ups_en() / r8153c_ups_en()
If the adapter is unplugged while we're looping in r8153b_ups_en() /
r8153c_ups_en() we could end up looping for 10 seconds (20 ms * 500
loops). Add code similar to what's done in other places in the driver
to check for unplug and bail.
Douglas Anderson [Fri, 20 Oct 2023 21:06:56 +0000 (14:06 -0700)]
r8152: Check for unplug in rtl_phy_patch_request()
If the adapter is unplugged while we're looping in
rtl_phy_patch_request() we could end up looping for 10 seconds (2 ms *
5000 loops). Add code similar to what's done in other places in the
driver to check for unplug and bail.
Douglas Anderson [Fri, 20 Oct 2023 21:06:55 +0000 (14:06 -0700)]
r8152: Release firmware if we have an error in probe
The error handling in rtl8152_probe() is missing a call to release
firmware. Add it in to match what's in the cleanup code in
rtl8152_disconnect().
Fixes: 9370f2d05a2a ("r8152: support request_firmware for RTL8153") Signed-off-by: Douglas Anderson <[email protected]> Reviewed-by: Grant Grundler <[email protected]> Signed-off-by: David S. Miller <[email protected]>
Douglas Anderson [Fri, 20 Oct 2023 21:06:54 +0000 (14:06 -0700)]
r8152: Cancel hw_phy_work if we have an error in probe
The error handling in rtl8152_probe() is missing a call to cancel the
hw_phy_work. Add it in to match what's in the cleanup code in
rtl8152_disconnect().
Fixes: a028a9e003f2 ("r8152: move the settings of PHY to a work queue") Signed-off-by: Douglas Anderson <[email protected]> Reviewed-by: Grant Grundler <[email protected]> Signed-off-by: David S. Miller <[email protected]>