Darrick J. Wong [Mon, 15 Apr 2024 21:54:50 +0000 (14:54 -0700)]
xfs: inactivate directory data blocks
Teach inode inactivation to delete all the incore buffers backing a
directory. In normal runtime this should never happen because the VFS
forbids rmdir on a non-empty directory.
In the next patch, online directory repair stands up a new directory,
exchanges it with the broken directory, and then drops the private
temporary directory. If we cancel the repair just prior to exchanging
the directory contents, the new directory will need to be torn down.
Note: If we commit the repair, reaping will take care of all the ondisk
space allocations and incore buffers for the old corrupt directory.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:49 +0000 (14:54 -0700)]
xfs: update the unlinked list when repairing link counts
When we're repairing the link counts of a file, we must ensure either
that the file has zero link count and is on the unlinked list; or that
it has nonzero link count and is not on the unlinked list.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:49 +0000 (14:54 -0700)]
xfs: ensure unlinked list state is consistent with nlink during scrub
Now that we have the means to tell if an inode is on an unlinked inode
list or not, we can check that an inode with zero link count is on the
unlinked list; and an inode that has nonzero link count is not on that
list. Make repair clean things up too.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:46 +0000 (14:54 -0700)]
xfs: scrub should set preen if attr leaf has holes
If an attr block indicates that it could use compaction, set the preen
flag to have the attr fork rebuilt, since the attr fork rebuilder can
take care of that for us.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:45 +0000 (14:54 -0700)]
xfs: repair extended attributes
If the extended attributes look bad, try to sift through the rubble to
find whatever keys/values we can, stage a new attribute structure in a
temporary file and use the atomic extent swapping mechanism to commit
the results in bulk.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:44 +0000 (14:54 -0700)]
xfs: use atomic extent swapping to fix user file fork data
Build on the code that was recently added to the temporary repair file
code so that we can atomically switch the contents of any file fork,
even if the fork is in local format. The upcoming functions to repair
xattrs, directories, and symlinks will need that capability.
Repair can lock out access to these user files by holding IOLOCK_EXCL on
these user files. Therefore, it is safe to drop the ILOCK of both the
file being repaired and the tempfile being used for staging, and cancel
the scrub transaction. We do this so that we can reuse the resource
estimation and transaction allocation functions used by a regular file
exchange operation.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:43 +0000 (14:54 -0700)]
xfs: create a blob array data structure
Create a simple 'blob array' data structure for storage of arbitrarily
sized metadata objects that will be used to reconstruct metadata. For
the intended usage (temporarily storing extended attribute names and
values) we only have to support storing objects and retrieving them.
Use the xfile abstraction to store the attribute information in memory
that can be swapped out.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:42 +0000 (14:54 -0700)]
xfs: enable discarding of folios backing an xfile
Create a new xfile function to discard the page cache that's backing
part of an xfile. The next patch wil use this to drop parts of an xfile
that aren't needed anymore.
Port the existing directory freespace block header checking function to
accept an owner number instead of an xfs_inode, then update the
callsites to use xfs_da_args.owner when possible.
Port the existing directory block header checking function to accept an
owner number instead of an xfs_inode, then update the callsites to use
xfs_da_args.owner when possible.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:40 +0000 (14:54 -0700)]
xfs: validate explicit directory data buffer owners
Port the existing directory data header checking function to accept an
owner number instead of an xfs_inode, then update the callsites to use
xfs_da_args.owner when possible.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:34 +0000 (14:54 -0700)]
xfs: use the xfs_da_args owner field to set new dir/attr block owner
When we're creating leaf, data, freespace, or dabtree blocks for
directories and xattrs, use the explicit owner field (instead of the
xfs_inode) to set the owner field. This will enable online repair to
construct replacement data structures in a temporary file without having
to change the owner fields prior to swapping the new and old structures.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:34 +0000 (14:54 -0700)]
xfs: add an explicit owner field to xfs_da_args
Add an explicit owner field to xfs_da_args, which will make it easier
for online fsck to set the owner field of the temporary directory and
xattr structures that it builds to repair damaged metadata.
Note: I hopefully found all the xfs_da_args definitions by looking for
automatic stack variable declarations and xfs_da_args.dp assignments:
Darrick J. Wong [Mon, 15 Apr 2024 21:54:32 +0000 (14:54 -0700)]
xfs: teach the tempfile to set up atomic file content exchanges
Create some new routines to exchange the contents of a temporary file
created to stage a repair with another ondisk file. This will be used
by the realtime summary repair function to commit atomically the new
rtsummary data, which will be staged in the tempfile.
The rest of XFS coordinates access to the realtime metadata inodes
solely through the ILOCK. For repair to hold its exclusive access to
the realtime summary file, it has to allocate a single large transaction
and roll it repeatedly throughout the repair while holding the ILOCK.
In turn, this means that for now there's only a partial file mapping
exchange implementation for the temporary file because we can only work
within an existing transaction.
For now, the only tempswap functions needed here are to estimate the
resource requirements of the exchange, reserve more space/quota to an
existing transaction, and kick off the actual exchange. The rest will
be added in a later patch in preparation for repairing xattrs and
directories.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:31 +0000 (14:54 -0700)]
xfs: support preallocating and copying content into temporary files
Create the routines we need to preallocate space in a temporary ondisk
file and then copy the contents of an xfile into the tempfile. The
upcoming rtsummary repair feature will construct the contents of a
realtime summary file in memory, after which it will want to copy all
that into the ondisk temporary file before atomically committing the new
rtsummary contents.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:30 +0000 (14:54 -0700)]
xfs: add the ability to reap entire inode forks
In preparation for supporting repair of indexed file-based metadata
(such as realtime bitmaps, directories, and extended attribute data),
add a function to reap the old blocks after a metadata repair finishes.
IOWs, this is an elaborate bunmapi call that deals with crosslinked
blocks by unmapping them without freeing them, and also scans for incore
buffers to invalidate.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:29 +0000 (14:54 -0700)]
xfs: refactor live buffer invalidation for repairs
In an upcoming patch, we will need to be able to look for xfs_buf
objects caching file-based metadata blocks without needing to walk the
(possibly corrupt) structures to find all the buffers. Repair already
has most of the code needed to scan the buffer cache, so hoist these
utility functions.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:28 +0000 (14:54 -0700)]
xfs: create temporary files and directories for online repair
Teach the online repair code how to create temporary files or
directories. These temporary files can be used to stage reconstructed
information until we're ready to perform an atomic extent swap to commit
the new metadata.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:27 +0000 (14:54 -0700)]
xfs: hide private inodes from bulkstat and handle functions
We're about to start adding functionality that uses internal inodes that
are private to XFS. What this means is that userspace should never be
able to access any information about these files, and should not be able
to open these files by handle.
To prevent users from ever finding the file or mis-interactions with the
security apparatus, set S_PRIVATE on the inode. Don't allow bulkstat,
open-by-handle, or linking of S_PRIVATE files into the directory tree.
This should keep private inodes actually private.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:26 +0000 (14:54 -0700)]
xfs: enable logged file mapping exchange feature
Add the XFS_SB_FEAT_INCOMPAT_EXCHRANGE feature to the set of features
that we will permit when mounting a filesystem. This turns on support
for the file range exchange feature.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:24 +0000 (14:54 -0700)]
xfs: capture inode generation numbers in the ondisk exchmaps log item
Per some very late review comments, capture the generation numbers of
both inodes involved in a file content exchange operation so that we
don't accidentally target files with have been reallocated.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:23 +0000 (14:54 -0700)]
xfs: support non-power-of-two rtextsize with exchange-range
The generic exchange-range alignment checks use (fast) bitmasking
operations to perform block alignment checks on the exchange parameters.
Unfortunately, bitmasks require that the alignment size be a power of
two. This isn't true for realtime devices with a non-power-of-two
extent size, so we have to copy-pasta the generic checks using long
division for this to work properly.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:22 +0000 (14:54 -0700)]
xfs: make file range exchange support realtime files
Now that bmap items support the realtime device, we can add the
necessary pieces to the file range exchange code to support exchanging
mappings. All we really need to do here is adjust the blockcount
upwards to the end of the rt extent and remove the inode checks.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:21 +0000 (14:54 -0700)]
xfs: condense symbolic links after a mapping exchange operation
The previous commit added a new file mapping exchange flag that enables
us to perform post-exchange processing on file2 once we're done
exchanging the extent mappings. Now add this ability for symlinks.
This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online symlink repair feature can
salvage the remote target in a temporary link and exchange the data fork
mappings when ready. If one file is in extents format and the other is
inline, we will have to promote both to extents format to perform the
exchange. After the exchange, we can try to condense the fixed symlink
down to inline format if possible.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:20 +0000 (14:54 -0700)]
xfs: condense directories after a mapping exchange operation
The previous commit added a new file mapping exchange flag that enables
us to perform post-swap processing on file2 once we're done exchanging
extent mappings. Now add this ability for directories.
This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online directory repair feature can
create salvaged dirents in a temporary directory and exchange the data
fork mappings when ready. If one file is in extents format and the
other is inline, we will have to promote both to extents format to
perform the exchange. After the exchange, we can try to condense the
fixed directory down to inline format if possible.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:20 +0000 (14:54 -0700)]
xfs: condense extended attributes after a mapping exchange operation
Add a new file mapping exchange flag that enables us to perform
post-exchange processing on file2 once we're done exchanging the extent
mappings. If we were swapping mappings between extended attribute
forks, we want to be able to convert file2's attr fork from block to
inline format.
(This implies that all fork contents are exchanged.)
This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online xattr repair feature can create
salvaged attrs in a temporary file and exchange the attr fork mappings
when ready. If one file is in extents format and the other is inline,
we will have to promote both to extents format to perform the exchange.
After the exchange, we can try to condense the fixed file's attr fork
back down to inline format if possible.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:18 +0000 (14:54 -0700)]
xfs: bind together the front and back ends of the file range exchange code
So far, we've constructed the front end of the file range exchange code
that does all the checking; and the back end of the file mapping
exchange code that actually does the work. Glue these two pieces
together so that we can turn on the functionality.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:17 +0000 (14:54 -0700)]
xfs: create deferred log items for file mapping exchanges
Now that we've created the skeleton of a log intent item to track and
restart file mapping exchange operations, add the upper level logic to
commit intent items and turn them into concrete work recorded in the
log. This builds on the existing bmap update intent items that have
been around for a while now.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:15 +0000 (14:54 -0700)]
xfs: create a incompat flag for atomic file mapping exchanges
Create a incompat flag so that we only attempt to process file mapping
exchange log items if the filesystem supports it, and a geometry flag to
advertise support if it's present.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:14 +0000 (14:54 -0700)]
xfs: introduce new file range exchange ioctl
Introduce a new ioctl to handle exchanging ranges of bytes
between files. The goal here is to perform the exchange atomically with
respect to applications -- either they see the file contents before the
exchange or they see that A-B is now B-A, even if the kernel crashes.
My original goal with all this code was to make it so that online repair
can build a replacement directory or xattr structure in a temporary file
and commit the repair by atomically exchanging all the data blocks
between the two files. However, I needed a way to test this mechanism
thoroughly, so I've been evolving an ioctl interface since then.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:12 +0000 (14:54 -0700)]
xfs: refactor non-power-of-two alignment checks
Create a helper function that can compute if a 64-bit number is an
integer multiple of a 32-bit number, where the 32-bit number is not
required to be an even power of two. This is needed for some new code
for the realtime device, where we can set 37k allocation units and then
have to remap them.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:10 +0000 (14:54 -0700)]
xfs: create a new helper to return a file's allocation unit
Create a new helper function to calculate the fundamental allocation
unit (i.e. the smallest unit of space we can allocate) of a file.
Things are going to get hairy with range-exchange on the realtime
device, so prepare for this now.
Remove the static attribute from xfs_is_falloc_aligned since the next
patch will need it.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:09 +0000 (14:54 -0700)]
xfs: declare xfs_file.c symbols in xfs_file.h
Move the two public symbols in xfs_file.c to xfs_file.h. We're about to
add more public symbols in that source file, so let's finally create the
header file.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:07 +0000 (14:54 -0700)]
xfs: move inode lease breaking functions to xfs_inode.c
The lease breaking functions operate at the scope of the entire VFS
inode, not subranges of a file. Move them to xfs_inode.c since they're
already declared in xfs_inode.h. This cleanup moves us closer to
having xfs_FOO.h declare only the symbols in xfs_FOO.c.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:06 +0000 (14:54 -0700)]
xfs: only clear log incompat flags at clean unmount
While reviewing the online fsck patchset, someone spied the
xfs_swapext_can_use_without_log_assistance function and wondered why we
go through this inverted-bitmask dance to avoid setting the
XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT feature.
(The same principles apply to the logged extended attribute update
feature bit in the since-merged LARP series.)
The reason for this dance is that xfs_add_incompat_log_feature is an
expensive operation -- it forces the log, pushes the AIL, and then if
nobody's beaten us to it, sets the feature bit and issues a synchronous
write of the primary superblock. That could be a one-time cost
amortized over the life of the filesystem, but the log quiesce and cover
operations call xfs_clear_incompat_log_features to remove feature bits
opportunistically. On a moderately loaded filesystem this leads to us
cycling those bits on and off over and over, which hurts performance.
Why do we clear the log incompat bits? Back in ~2020 I think Dave and I
had a conversation on IRC[2] about what the log incompat bits represent.
IIRC in that conversation we decided that the log incompat bits protect
unrecovered log items so that old kernels won't try to recover them and
barf. Since a clean log has no protected log items, we could clear the
bits at cover/quiesce time.
As Dave Chinner pointed out in the thread, clearing log incompat bits at
unmount time has positive effects for golden root disk image generator
setups, since the generator could be running a newer kernel than what
gets written to the golden image -- if there are log incompat fields set
in the golden image that was generated by a newer kernel/OS image
builder then the provisioning host cannot mount the filesystem even
though the log is clean and recovery is unnecessary to mount the
filesystem.
Given that it's expensive to set log incompat bits, we really only want
to do that once per bit per mount. Therefore, I propose that we only
clear log incompat bits as part of writing a clean unmount record. Do
this by adding an operational state flag to the xfs mount that guards
whether or not the feature bit clearing can actually take place.
This eliminates the l_incompat_users rwsem that we use to protect a log
cleaning operation from clearing a feature bit that a frontend thread is
trying to set -- this lock adds another way to fail w.r.t. locking. For
the swapext series, I shard that into multiple locks just to work around
the lockdep complaints, and that's fugly.
Thread 20558 holds an AGI buffer and is trying to grab the ILOCK of the
root directory. Thread 20559 holds the root directory ILOCK and is
trying to grab the AGI of an inode that is one of the root directory's
children. The AGI held by 20558 is the same buffer that 20559 is trying
to acquire. In other words, this is an ABBA deadlock.
In general, the lock order is ILOCK and then AGI -- rename does this
while preparing for an operation involving whiteouts or renaming files
out of existence; and unlink does this when moving an inode to the
unlinked list. The only place where we do it in the opposite order is
on the child during an icreate, but at that point the child is marked
INEW and is not visible to other threads.
Work around this deadlock by replacing the blocking ilock attempt with a
nonblocking loop that aborts after 30 seconds. Relax for a jiffy after
a failed lock attempt.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:04 +0000 (14:54 -0700)]
xfs: fix an AGI lock acquisition ordering problem in xrep_dinode_findmode
While reviewing the next patch which fixes an ABBA deadlock between the
AGI and a directory ILOCK, someone asked a question about why we're
holding the AGI in the first place. The reason for that is to quiesce
the inode structures for that AG while we do a repair.
I then realized that the xrep_dinode_findmode invokes xchk_iscan_iter,
which walks the inobts (and hence the AGIs) to find all the inodes.
This itself is also an ABBA vector, since the damaged inode could be in
AG 5, which we hold while we scan AG 0 for directories. 5 -> 0 is not
allowed.
To address this, modify the iscan to allow trylock of the AGI buffer
using the flags argument to xfs_ialloc_read_agi that the previous patch
added.
Darrick J. Wong [Mon, 15 Apr 2024 21:54:03 +0000 (14:54 -0700)]
xfs: pass xfs_buf lookup flags to xfs_*read_agi
Allow callers to pass buffer lookup flags to xfs_read_agi and
xfs_ialloc_read_agi. This will be used in the next patch to fix a
deadlock in the online fsck inode scanner.
Merge tag 'pull-sysfs-annotation-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull sysfs fix from Al Viro:
"Get rid of lockdep false positives around sysfs/overlayfs
syzbot has uncovered a class of lockdep false positives for setups
with sysfs being one of the backing layers in overlayfs. The root
cause is that of->mutex allocated when opening a sysfs file read-only
(which overlayfs might do) is confused with of->mutex of a file opened
writable (held in write to sysfs file, which overlayfs won't do).
Assigning them separate lockdep classes fixes that bunch and it's
obviously safe"
* tag 'pull-sysfs-annotation-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
kernfs: annotate different lockdep class for of->mutex of writable files
Merge tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Follow up fixes for the BHI mitigations code
- Fix !SPECULATION_MITIGATIONS bug not turning off mitigations as
expected
- Work around an APIC emulation bug when the kernel is built with Clang
and run as a SEV guest
- Follow up x86 topology fixes
* tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/amd: Move TOPOEXT enablement into the topology parser
x86/cpu/amd: Make the NODEID_MSR union actually work
x86/cpu/amd: Make the CPUID 0x80000008 parser correct
x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI
x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto
x86/bugs: Clarify that syscall hardening isn't a BHI mitigation
x86/bugs: Fix BHI handling of RRSBA
x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr'
x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES
x86/bugs: Fix BHI documentation
x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n
x86/topology: Don't update cpu_possible_map in topo_set_cpuids()
x86/bugs: Fix return type of spectre_bhi_state()
x86/apic: Force native_apic_mem_read() to use the MOV instruction
Merge tag 'locking-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fix from Ingo Molnar:
"Fix a PREEMPT_RT build bug"
* tag 'locking-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking: Make rwsem_assert_held_write_nolockdep() build with PREEMPT_RT=y
Merge tag 'irq-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Ingo Molnar:
"Fix a bug in the GIC irqchip driver"
* tag 'irq-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gic-v3-its: Fix VSYNC referencing an unmapped VPE on GIC v4.1
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio bugfixes from Michael Tsirkin:
"Some small, obvious (in hindsight) bugfixes:
- new ioctl in vhost-vdpa has a wrong # - not too late to fix
- vhost has apparently been lacking an smp_rmb() - due to code
duplication :( The duplication will be fixed in the next merge
cycle, this is a minimal fix
- an error message in vhost talks about guest moving used index -
which of course never happens, guest only ever moves the available
index
- i2c-virtio didn't set the driver owner so it did not get refcounted
correctly"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost: correct misleading printing information
vhost-vdpa: change ioctl # for VDPA_GET_VRING_SIZE
virtio: store owner from modules with register_virtio_driver()
vhost: Add smp_rmb() in vhost_enable_notify()
vhost: Add smp_rmb() in vhost_vq_avail_empty()
Merge tag 'dma-maping-6.9-2024-04-14' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
- fix up swiotlb buffer padding even more (Petr Tesarik)
- fix for partial dma_sync on swiotlb (Michael Kelley)
- swiotlb debugfs fix (Dexuan Cui)
* tag 'dma-maping-6.9-2024-04-14' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: do not set total_used to 0 in swiotlb_create_debugfs_files()
swiotlb: fix swiotlb_bounce() to do partial sync's correctly
swiotlb: extend buffer pre-padding to alloc_align_mask if necessary
Amir Goldstein [Fri, 5 Apr 2024 14:56:35 +0000 (17:56 +0300)]
kernfs: annotate different lockdep class for of->mutex of writable files
The writable file /sys/power/resume may call vfs lookup helpers for
arbitrary paths and readonly files can be read by overlayfs from vfs
helpers when sysfs is a lower layer of overalyfs.
To avoid a lockdep warning of circular dependency between overlayfs
inode lock and kernfs of->mutex, use a different lockdep class for
writable and readonly kernfs files.
Merge tag 'ata-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Pull ata fixes from Damien Le Moal:
- Add the mask_port_map parameter to the ahci driver. This is a
follow-up to the recent snafu with the ASMedia controller and its
virtual port hidding port-multiplier devices. As ASMedia confirmed
that there is no way to determine if these slow-to-probe virtual
ports are actually representing the ports of a port-multiplier
devices, this new parameter allow masking ports to significantly
speed up probing during system boot, resulting in shorter boot times.
- A fix for an incorrect handling of a port unlock in
ata_scsi_dev_rescan().
- Allow command duration limits to be detected for ACS-4 devices are
there are such devices out in the field.
* tag 'ata-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: libata-core: Allow command duration limits detection for ACS-4 drives
ata: libata-scsi: Fix ata_scsi_dev_rescan() error path
ata: ahci: Add mask_port_map module parameter
Merge tag 'v6.9-rc3-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- fix for oops in cifs_get_fattr of deleted files
- fix for the remote open counter going negative in some directory
lease cases
- fix for mkfifo to instantiate dentry to avoid possible crash
- important fix to allow handling key rotation for mount and remount
(ie cases that are becoming more common when password that was used
for the mount will expire soon but will be replaced by new password)
* tag 'v6.9-rc3-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb3: fix broken reconnect when password changing on the server by allowing password rotation
smb: client: instantiate when creating SFU files
smb3: fix Open files on server counter going negative
smb: client: fix NULL ptr deref in cifs_mark_open_handles_for_deleted_file()
Igor Pylypiv [Thu, 11 Apr 2024 20:12:24 +0000 (20:12 +0000)]
ata: libata-core: Allow command duration limits detection for ACS-4 drives
Even though the command duration limits (CDL) feature was first added
in ACS-5 (major version 12), there are some ACS-4 (major version 11)
drives that implement CDL as well.
IDENTIFY_DEVICE, SUPPORTED_CAPABILITIES, and CURRENT_SETTINGS log pages
are mandatory in the ACS-4 standard so it should be safe to read these
log pages on older drives implementing the ACS-4 standard.
Fixes: 62e4a60e0cdb ("scsi: ata: libata: Detect support for command duration limits") Cc: [email protected] Signed-off-by: Igor Pylypiv <[email protected]> Signed-off-by: Damien Le Moal <[email protected]>
Commit 0c76106cb975 ("scsi: sd: Fix TCG OPAL unlock on system resume")
incorrectly handles failures of scsi_resume_device() in
ata_scsi_dev_rescan(), leading to a double call to
spin_unlock_irqrestore() to unlock a device port. Fix this by redefining
the goto labels used in case of errors and only unlock the port
scsi_scan_mutex when scsi_resume_device() fails.
Bug found with the Smatch static checker warning:
drivers/ata/libata-scsi.c:4774 ata_scsi_dev_rescan()
error: double unlocked 'ap->lock' (orig line 4757)
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fix from Catalin Marinas:
"Fix the TLBI RANGE operand calculation causing live migration under
KVM/arm64 to miss dirty pages due to stale TLB entries"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: tlb: Fix TLBI RANGE operand
Merge tag 'soc-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC fixes from Arnd Bergmann:
"The device tree changes this time are all for NXP i.MX platforms,
addressing issues with clocks and regulators on i.MX7 and i.MX8.
The old OMAP2 based Nokia N8x0 tablet get a couple of code fixes for
regressions that came in.
The ARM SCMI and FF-A firmware interfaces get a couple of minor bug
fixes.
A regression fix for RISC-V cache management addresses a problem with
probe order on Sifive cores"
* tag 'soc-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (23 commits)
MAINTAINERS: Change Krzysztof Kozlowski's email address
arm64: dts: imx8qm-ss-dma: fix can lpcg indices
arm64: dts: imx8-ss-dma: fix can lpcg indices
arm64: dts: imx8-ss-dma: fix adc lpcg indices
arm64: dts: imx8-ss-dma: fix pwm lpcg indices
arm64: dts: imx8-ss-dma: fix spi lpcg indices
arm64: dts: imx8-ss-conn: fix usb lpcg indices
arm64: dts: imx8-ss-lsio: fix pwm lpcg indices
ARM: dts: imx7s-warp: Pass OV2680 link-frequencies
ARM: dts: imx7-mba7: Use 'no-mmc' property
arm64: dts: imx8-ss-conn: fix usdhc wrong lpcg clock order
arm64: dts: freescale: imx8mp-venice-gw73xx-2x: fix USB vbus regulator
arm64: dts: freescale: imx8mp-venice-gw72xx-2x: fix USB vbus regulator
cache: sifive_ccache: Partially convert to a platform driver
firmware: arm_scmi: Make raw debugfs entries non-seekable
firmware: arm_scmi: Fix wrong fastchannel initialization
firmware: arm_ffa: Fix the partition ID check in ffa_notification_info_get()
ARM: OMAP2+: fix USB regression on Nokia N8x0
mmc: omap: restore original power up/down steps
mmc: omap: fix deferred probe
...
Merge tag 'iommu-fixes-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:
- Intel VT-d Fixes:
- Allocate local memory for PRQ page
- Fix WARN_ON in iommu probe path
- Fix wrong use of pasid config
- AMD IOMMU Fixes:
- Lock inversion fix
- Log message severity fix
- Disable SNP when v2 page-tables are used
- Mediatek driver:
- Fix module autoloading
* tag 'iommu-fixes-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/amd: Change log message severity
iommu/vt-d: Fix WARN_ON in iommu probe path
iommu/vt-d: Allocate local memory for page request queue
iommu/vt-d: Fix wrong use of pasid config
iommu: mtk: fix module autoloading
iommu/amd: Do not enable SNP when V2 page table is enabled
iommu/amd: Fix possible irq lock inversion dependency issue
Merge tag 'pci-v6.9-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci fixes from Bjorn Helgaas:
- Revert a quirk that prevented Secondary Bus Reset for LSI / Agere
FW643.
We thought the device was broken, but the reset does work correctly
on other platforms, and the reset avoids leaking data out of VMs
(Bjorn Helgaas)
- Update MAINTAINERS to reflect that Gustavo Pimentel is no longer
reachable (Manivannan Sadhasivam)
* tag 'pci-v6.9-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
Revert "PCI: Mark LSI FW643 to avoid bus reset"
MAINTAINERS: Drop Gustavo Pimentel as PCI DWC Maintainer
* tag 'block-6.9-20240412' of git://git.kernel.dk/linux:
block: fix that blk_time_get_ns() doesn't update time after schedule
block: allow device to have both virt_boundary_mask and max segment size
block: fix q->blkg_list corruption during disk rebind
blk-iocost: avoid out of bounds shift
raid1: fix use-after-free for original bio in raid1_write_request()
Merge tag 'io_uring-6.9-20240412' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Fix for sigmask restoring while waiting for events (Alexey)
- Typo fix in comment (Haiyue)
- Fix for a msg_control retstore on SEND_ZC retries (Pavel)
* tag 'io_uring-6.9-20240412' of git://git.kernel.dk/linux:
io-uring: correct typo in comment for IOU_F_TWQ_LAZY_WAKE
io_uring/net: restore msg_control on sendzc retry
io_uring: Fix io_cqring_wait() not restoring sigmask on get_timespec64() failure
Merge tag 'ceph-for-6.9-rc4' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"Two CephFS fixes marked for stable and a MAINTAINERS update"
* tag 'ceph-for-6.9-rc4' of https://github.com/ceph/ceph-client:
MAINTAINERS: remove myself as a Reviewer for Ceph
ceph: switch to use cap_delay_lock for the unlink delay list
ceph: redirty page before returning AOP_WRITEPAGE_ACTIVATE
Commit d96c36004e31 ("tracing: Fix FTRACE_RECORD_RECURSION_SIZE Kconfig
entry") removed a hidden tab because it apparently showed breakage in
some third-party kernel config parsing tool.
It wasn't clear what tool it was, but let's make sure it gets fixed.
Because if you can't parse tabs as whitespace, you should not be parsing
the kernel Kconfig files.
In fact, let's make such breakage more obvious than some esoteric ftrace
record size option. If you can't parse tabs, you can't have page sizes.
Yes, tab-vs-space confusion is sadly a traditional Unix thing, and
'make' is famous for being broken in this regard. But no, that does not
mean that it's ok.
I'd add more random tabs to our Kconfig files, but I don't want to make
things uglier than necessary. But it *might* bbe necessary if it turns
out we see more of this kind of silly tooling.
Merge tag 'trace-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix the buffer_percent accounting as it is dependent on three
variables:
1) pages_read - number of subbuffers read
2) pages_lost - number of subbuffers lost due to overwrite
3) pages_touched - number of pages that a writer entered
These three counters only increment, and to know how many active
pages there are on the buffer at any given time, the pages_read and
pages_lost are subtracted from pages_touched.
But the pages touched was incremented whenever any writer went to the
next subbuffer even if it wasn't the only one, so it was incremented
more than it should be causing the counter for how many subbuffers
currently have content incorrect, which caused the buffer_percent
that holds waiters until the ring buffer is filled to a given
percentage to wake up early.
- Fix warning of unused functions when PERF_EVENTS is not configured in
- Replace bad tab with space in Kconfig for FTRACE_RECORD_RECURSION_SIZE
- Fix to some kerneldoc function comments in eventfs code.
* tag 'trace-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Only update pages_touched when a new page is touched
tracing: hide unused ftrace_event_id_fops
tracing: Fix FTRACE_RECORD_RECURSION_SIZE Kconfig entry
eventfs: Fix kernel-doc comments to functions
Merge tag 'drm-fixes-2024-04-12' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Looks like everyone woke up after holidays, this weeks pull has a
bunch of stuff all over, 2 weeks worth of amdgpu is a lot of it, then
i915/xe have a few, a bunch of msm fixes, then some scattered driver
fixes.
I expect things will settle down for rc5.
client:
- Protect connector modes with mode_config mutex
ast:
- Fix soft lockup
host1x:
- Do not setup DMA for virtual addresses
ivpu:
- Fix deadlock in context_xa
- PCI fixes
- Fixes to error handling
nouveau:
- gsp: Fix OOB access
- Fix casting
panfrost:
- Fix error path in MMU code
qxl:
- Revert "drm/qxl: simplify qxl_fence_wait"
vmwgfx:
- Enable DMA for SEV mappings
i915:
- Couple CDCLK programming fixes
- HDCP related fix
- 4 Bigjoiner related fixes
- Fix for a circular locking around GuC on reset+wedged case
msm:
- DP refcount leak fix on disconnect
- Add missing newlines to prints in msm_fb and msm_kms
- fix dpu debugfs entry permissions
- Fix the interface table for the catalog of X1E80100
- fix irq message printing
- Bindings fix to add DP node as child of mdss for mdss node
- Minor typo fix in DP driver API which handles port status change
- fix CHRASHDUMP_READ()
- fix HHB (highest bank bit) for a619 to fix UBWC corruption
* tag 'drm-fixes-2024-04-12' of https://gitlab.freedesktop.org/drm/kernel: (65 commits)
amdkfd: use calloc instead of kzalloc to avoid integer overflow
drm/xe: Label RING_CONTEXT_CONTROL as masked
drm/xe/xe_migrate: Cast to output precision before multiplying operands
drm/xe/hwmon: Cast result to output precision on left shift of operand
drm/xe/display: Fix double mutex initialization
drm/amdgpu: differentiate external rev id for gfx 11.5.0
drm/amd/display: Adjust dprefclk by down spread percentage.
drm/amd/display: Set VSC SDP Colorimetry same way for MST and SST
drm/amd/display: Program VSC SDP colorimetry for all DP sinks >= 1.4
drm/amd/display: fix disable otg wa logic in DCN316
drm/amd/display: Do not recursively call manual trigger programming
drm/amd/display: always reset ODM mode in context when adding first plane
drm/amdgpu: fix incorrect number of active RBs for gfx11
drm/amd/display: Return max resolution supported by DWB
amd/amdkfd: sync all devices to wait all processes being evicted
drm/amdgpu: clear set_q_mode_offs when VM changed
drm/amdgpu: Fix VCN allocation in CPX partition
drm/amd/pm: fix the high voltage issue after unload
drm/amd/display: Skip on writeback when it's not applicable
drm/amdgpu: implement IRQ_STATE_ENABLE for SDMA v4.4.2
...
block: fix that blk_time_get_ns() doesn't update time after schedule
While monitoring the throttle time of IO from iocost, it's found that
such time is always zero after the io_schedule() from ioc_rqos_throttle,
for example, with the following debug patch:
+ printk("%s-%d: %s enter %llu\n", current->comm, current->pid, __func__, blk_time_get_ns());
while (true) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (wait.committed)
break;
io_schedule();
}
+ printk("%s-%d: %s exit %llu\n", current->comm, current->pid, __func__, blk_time_get_ns());
It can be observerd that blk_time_get_ns() always return the same time:
And I think the root cause is that 'PF_BLOCK_TS' is always cleared
by blk_flush_plug() before scheduel(), hence blk_plug_invalidate_ts()
will never be called:
io_schedule:
io_schedule_prepare
blk_flush_plug
__blk_flush_plug
/* the flag is cleared, while time is not */
current->flags &= ~PF_BLOCK_TS;
schedule
sched_update_worker
/* the flag is not set, hence plug->cur_ktime is not cleared */
if (tsk->flags & PF_BLOCK_TS)
blk_plug_invalidate_ts()
blk_time_get_ns
/* got the time stashed before schedule */
return plug->cur_ktime;
Fix the problem by clearing cached time in __blk_flush_plug().
John Stultz [Wed, 10 Apr 2024 23:26:30 +0000 (16:26 -0700)]
selftests: timers: Fix abs() warning in posix_timers test
Building with clang results in the following warning:
posix_timers.c:69:6: warning: absolute value function 'abs' given an
argument of type 'long long' but has parameter of type 'int' which may
cause truncation of value [-Wabsolute-value]
if (abs(diff - DELAY * USECS_PER_SEC) > USECS_PER_SEC / 2) {
^
So switch to using llabs() instead.
selftests: kselftest: Mark functions that unconditionally call exit() as __noreturn
After commit 6d029c25b71f ("selftests/timers/posix_timers: Reimplement
check_timer_distribution()"), clang warns:
tools/testing/selftests/timers/../kselftest.h:398:6: warning: variable 'major' is used uninitialized whenever '||' condition is true [-Wsometimes-uninitialized]
398 | if (uname(&info) || sscanf(info.release, "%u.%u.", &major, &minor) != 2)
| ^~~~~~~~~~~~
tools/testing/selftests/timers/../kselftest.h:401:9: note: uninitialized use occurs here
401 | return major > min_major || (major == min_major && minor >= min_minor);
| ^~~~~
tools/testing/selftests/timers/../kselftest.h:398:6: note: remove the '||' if its condition is always false
398 | if (uname(&info) || sscanf(info.release, "%u.%u.", &major, &minor) != 2)
| ^~~~~~~~~~~~~~~
tools/testing/selftests/timers/../kselftest.h:395:20: note: initialize the variable 'major' to silence this warning
395 | unsigned int major, minor;
| ^
| = 0
This is a false positive because if uname() fails, ksft_exit_fail_msg()
will be called, which unconditionally calls exit(), a noreturn function.
However, clang does not know that ksft_exit_fail_msg() will call exit() at
the point in the pipeline that the warning is emitted because inlining has
not occurred, so it assumes control flow will resume normally after
ksft_exit_fail_msg() is called.
Make it clear to clang that all of the functions that call exit()
unconditionally in kselftest.h are noreturn transitively by marking them
explicitly with '__attribute__((__noreturn__))', which clears up the
warning above and any future warnings that may appear for the same reason.
After commit 6d029c25b71f ("selftests/timers/posix_timers: Reimplement
check_timer_distribution()") the following warning occurs when building
with an older gcc:
posix_timers.c:250:2: warning: format not a string literal and no format arguments [-Wformat-security]
250 | ksft_print_msg(errmsg);
| ^~~~~~~~~~~~~~
Fix this up by changing it to ksft_print_msg("%s", errmsg)
Lu Baolu [Thu, 11 Apr 2024 03:07:44 +0000 (11:07 +0800)]
iommu/vt-d: Fix WARN_ON in iommu probe path
Commit 1a75cc710b95 ("iommu/vt-d: Use rbtree to track iommu probed
devices") adds all devices probed by the iommu driver in a rbtree
indexed by the source ID of each device. It assumes that each device
has a unique source ID. This assumption is incorrect and the VT-d
spec doesn't state this requirement either.
The reason for using a rbtree to track devices is to look up the device
with PCI bus and devfunc in the paths of handling ATS invalidation time
out error and the PRI I/O page faults. Both are PCI ATS feature related.
Only track the devices that have PCI ATS capabilities in the rbtree to
avoid unnecessary WARN_ON in the iommu probe path. Otherwise, on some
platforms below kernel splat will be displayed and the iommu probe results
in failure.
Thomas Gleixner [Wed, 10 Apr 2024 19:45:28 +0000 (21:45 +0200)]
x86/cpu/amd: Make the NODEID_MSR union actually work
A system with NODEID_MSR was reported to crash during early boot without
any output.
The reason is that the union which is used for accessing the bitfields in
the MSR is written wrongly and the resulting executable code accesses the
wrong part of the MSR data.
As a consequence a later division by that value results in 0 and that
result is used for another division as divisor, which obviously does not
work well.
The magic world of C, unions and bitfields:
union {
u64 bita : 3,
bitb : 3;
u64 all;
} x;
x.all = foo();
a = x.bita;
b = x.bitb;
results in the effective executable code of:
a = b = x.bita;
because bita and bitb are treated as union members and therefore both end
up at bit offset 0.
Thomas Gleixner [Wed, 10 Apr 2024 19:45:27 +0000 (21:45 +0200)]
x86/cpu/amd: Make the CPUID 0x80000008 parser correct
CPUID 0x80000008 ECX.cpu_nthreads describes the number of threads in the
package. The parser uses this value to initialize the SMT domain level.
That's wrong because cpu_nthreads does not describe the number of threads
per physical core. So this needs to set the CORE domain level and let the
later parsers set the SMT shift if available.
Preset the SMT domain level with the assumption of one thread per core,
which is correct ifrt here are no other CPUID leafs to parse, and propagate
cpu_nthreads and the core level APIC bitwidth into the CORE domain.
x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI
For consistency with the other CONFIG_MITIGATION_* options, replace the
CONFIG_SPECTRE_BHI_{ON,OFF} options with a single
CONFIG_MITIGATION_SPECTRE_BHI option.
iommu/amd: Do not enable SNP when V2 page table is enabled
DTE[Mode]=0 is not supported when SNP is enabled in the host. That means
to support SNP, IOMMU must be configured with V1 page table (See IOMMU
spec [1] for the details). If user passes kernel command line to configure
IOMMU domains with v2 page table (amd_iommu=pgtbl_v2) then disable SNP
as the user asked by not forcing the page table to v1.
Dave Airlie [Fri, 12 Apr 2024 01:01:44 +0000 (11:01 +1000)]
Merge tag 'drm-msm-next-2024-04-11' of https://gitlab.freedesktop.org/drm/msm into drm-fixes
Fixes for v6.9
Display:
- Fixes for PM refcount leak when DP goes to disconnected state and
also when link training fails. This is also one of the issues found
with the pm runtime series
- Add missing newlines to prints in msm_fb and msm_kms
- Change permissions of some dpu debugfs entries which write to const
data from catalog to read-only to avoid protection faults
- Fix the interface table for the catalog of X1E80100. This is an
important fix to bringup DP for X1E80100.
- Logging fix to print the callback symbol in the invalid IRQ message
case rather than printing when its known to be NULL.
- Bindings fix to add DP node as child of mdss for mdss node
- Minor typo fix in DP driver API which handles port status change
GPU:
- fix CHRASHDUMP_READ()
- fix HHB (highest bank bit) for a619 to fix UBWC corruption
Merge tag 'cxl-fixes-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull cxl fixes from Dave Jiang:
- Fix index of Clear Event Record handles in cxl_clear_event_record()
- Fix use before init of map->reg_type in cxl_decode_regblock()
- Fix initialization of mbox_cmd.size_out in cxl_mem_get_records_log()
- Fix CXL path access_coordinate computation:
- Remove unneded check of iter in loop
- Fix of retrieving of access_coordinate in PCI topology walk
- Fix of incorrect region access_coordinate data calculation
- Consolidate of access_coordinates attached to downstream port
context
- Add check to validate access_coordinate validity to prevent
incorrect data being exposed via sysfs
* tag 'cxl-fixes-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
cxl: Add checks to access_coordinate calculation to fail missing data
cxl: Consolidate dport access_coordinate ->hb_coord and ->sw_coord into ->coord
cxl: Fix incorrect region perf data calculation
cxl: Fix retrieving of access_coordinates in PCIe path
cxl: Remove checking of iter in cxl_endpoint_get_perf_coordinates()
cxl/core: Fix initialization of mbox_cmd.size_out in get event
cxl/core/regs: Fix usage of map->reg_type in cxl_decode_regblock() before assigned
cxl/mem: Fix for the index of Clear Event Record Handle