]> Git Repo - linux.git/log
linux.git
11 months agoMerge tag 'repair-symlink-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux...
Chandan Babu R [Tue, 16 Apr 2024 06:58:57 +0000 (12:28 +0530)]
Merge tag 'repair-symlink-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA

xfs: online repair of symbolic links

The patches in this set adds the ability to repair the target buffer of
a symbolic link, using the same salvage, rebuild, and swap strategy used
everywhere else.

Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Chandan Babu R <[email protected]>
* tag 'repair-symlink-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: online repair of symbolic links
  xfs: pass the owner to xfs_symlink_write_target
  xfs: expose xfs_bmap_local_to_extents for online repair

11 months agoMerge tag 'repair-orphanage-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux...
Chandan Babu R [Tue, 16 Apr 2024 06:48:18 +0000 (12:18 +0530)]
Merge tag 'repair-orphanage-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA

xfs: move orphan files to lost and found

Orphaned files are defined to be files with nonzero ondisk link count
but no observable parent directory.  This series enables online repair
to reparent orphaned files into the filesystem directory tree, and wires
up this reparenting ability into the directory, file link count, and
parent pointer repair functions.  This is how we fix files with positive
link count that are not reachable through the directory tree.

This patch will also create the orphanage directory (lost+found) if it
is not present.  In contrast to xfs_repair, we follow e2fsck in creating
the lost+found without group or other-owner access to avoid accidental
disclosure of files that were previously hidden by an 0700 directory.
That's silly security, but people have been known to do it.

Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Chandan Babu R <[email protected]>
* tag 'repair-orphanage-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: ensure dentry consistency when the orphanage adopts a file
  xfs: move files to orphanage instead of letting nlinks drop to zero
  xfs: move orphan files to the orphanage

11 months agoMerge tag 'repair-dirs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kerne...
Chandan Babu R [Tue, 16 Apr 2024 06:31:06 +0000 (12:01 +0530)]
Merge tag 'repair-dirs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA

xfs: online repair of directories

This series employs atomic extent swapping to enable safe reconstruction
of directory data.  For now, XFS does not support reverse directory
links (aka parent pointers), so we can only salvage the dirents of a
directory and construct a new structure.

Directory repair therefore consists of five main parts:

First, we walk the existing directory to salvage as many entries as we
can, by adding them as new directory entries to the repair temp dir.

Second, we validate the parent pointer found in the directory.  If one
was not found, we scan the entire filesystem looking for a potential
parent.

Third, we use atomic extent swaps to exchange the entire data fork
between the two directories.

Fourth, we reap the old directory blocks as carefully as we can.

To wrap up the directory repair code, we need to add to the regular
filesystem the ability to free all the data fork blocks in a directory.
This does not change anything with normal directories, since they must
still unlink and shrink one entry at a time.  However, this will
facilitate freeing of partially-inactivated temporary directories during
log recovery.

The second half of this patchset implements repairs for the dotdot
entries of directories.  For now there is only rudimentary support for
this, because there are no directory parent pointers, so the best we can
do is scanning the filesystem and the VFS dcache for answers.

Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Chandan Babu R <[email protected]>
* tag 'repair-dirs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: ask the dentry cache if it knows the parent of a directory
  xfs: online repair of parent pointers
  xfs: scan the filesystem to repair a directory dotdot entry
  xfs: online repair of directories
  xfs: inactivate directory data blocks

11 months agoMerge tag 'repair-unlinked-inode-state-6.10_2024-04-15' of https://git.kernel.org...
Chandan Babu R [Tue, 16 Apr 2024 06:27:10 +0000 (11:57 +0530)]
Merge tag 'repair-unlinked-inode-state-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA

xfs: online repair of inode unlinked state

This series adds some logic to the inode scrubbers so that they can
detect and deal with consistency errors between the link count and the
per-inode unlinked list state.  The helpers needed to do this are
presented here because they are a prequisite for rebuildng directories,
since we need to get a rebuilt non-empty directory off the unlinked
list.

Note that this patchset does not provide comprehensive reconstruction of
the AGI unlinked list; that is coming in a subsequent patchset.

Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Chandan Babu R <[email protected]>
* tag 'repair-unlinked-inode-state-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: update the unlinked list when repairing link counts
  xfs: ensure unlinked list state is consistent with nlink during scrub

11 months agoMerge tag 'repair-xattrs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux...
Chandan Babu R [Tue, 16 Apr 2024 06:23:09 +0000 (11:53 +0530)]
Merge tag 'repair-xattrs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA

xfs: online repair of extended attributes

This series employs atomic extent swapping to enable safe reconstruction
of extended attribute data attached to a file.  Because xattrs do not
have any redundant information to draw off of, we can at best salvage
as much data as we can and build a new structure.

Rebuilding an extended attribute structure consists of these three
steps:

First, we walk the existing attributes to salvage as many of them as we
can, by adding them as new attributes attached to the repair tempfile.
We need to add a new xfile-based data structure to hold blobs of
arbitrary length to stage the xattr names and values.

Second, we write the salvaged attributes to a temporary file, and use
atomic extent swaps to exchange the entire attribute fork between the
two files.

Finally, we reap the old xattr blocks (which are now in the temporary
file) as carefully as we can.

Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Chandan Babu R <[email protected]>
* tag 'repair-xattrs-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: create an xattr iteration function for scrub
  xfs: flag empty xattr leaf blocks for optimization
  xfs: scrub should set preen if attr leaf has holes
  xfs: repair extended attributes
  xfs: use atomic extent swapping to fix user file fork data
  xfs: create a blob array data structure
  xfs: enable discarding of folios backing an xfile

11 months agoMerge tag 'dirattr-validate-owners-6.10_2024-04-15' of https://git.kernel.org/pub...
Chandan Babu R [Tue, 16 Apr 2024 06:18:46 +0000 (11:48 +0530)]
Merge tag 'dirattr-validate-owners-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA

xfs: set and validate dir/attr block owners

There are a couple of significant changes that need to be made to the
directory and xattr code before we can support online repairs of those
data structures.

The first change is because online repair is designed to use libxfs to
create a replacement dir/xattr structure in a temporary file, and use
atomic extent swapping to commit the corrected structure.  To avoid the
performance hit of walking every block of the new structure to rewrite
the owner number before the swap, we instead change libxfs to allow
callers of the dir and xattr code the ability to set an explicit owner
number to be written into the header fields of any new blocks that are
created.  For regular operation this will be the directory inode number.

The second change is to update the dir/xattr code to actually *check*
the owner number in each block that is read off the disk, since we don't
currently do that.

Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Chandan Babu R <[email protected]>
* tag 'dirattr-validate-owners-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: validate explicit directory free block owners
  xfs: validate explicit directory block buffer owners
  xfs: validate explicit directory data buffer owners
  xfs: validate directory leaf buffer owners
  xfs: validate dabtree node buffer owners
  xfs: validate attr remote value buffer owners
  xfs: validate attr leaf buffer owners
  xfs: reduce indenting in xfs_attr_node_list
  xfs: use the xfs_da_args owner field to set new dir/attr block owner
  xfs: add an explicit owner field to xfs_da_args

11 months agoMerge tag 'repair-rtsummary-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux...
Chandan Babu R [Tue, 16 Apr 2024 06:15:33 +0000 (11:45 +0530)]
Merge tag 'repair-rtsummary-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA

xfs: online repair of realtime summaries

We now have all the infrastructure we need to repair file metadata.
We'll begin with the realtime summary file, because it is the least
complex data structure.  To support this we need to add three more
pieces to the temporary file code from the previous patchset --
preallocating space in the temp file, formatting metadata into that
space and writing the blocks to disk, and swapping the fork mappings
atomically.

After that, the actual reconstruction of the realtime summary
information is pretty simple, since we can simply write the incore
copy computed by the rtsummary scrubber to the temporary file, swap the
contents, and reap the old blocks.

Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Chandan Babu R <[email protected]>
* tag 'repair-rtsummary-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: online repair of realtime summaries
  xfs: teach the tempfile to set up atomic file content exchanges
  xfs: support preallocating and copying content into temporary files

11 months agoMerge tag 'repair-tempfiles-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux...
Chandan Babu R [Tue, 16 Apr 2024 06:10:13 +0000 (11:40 +0530)]
Merge tag 'repair-tempfiles-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA

xfs: create temporary files for online repair

As mentioned earlier, the repair strategy for file-based metadata is to
build a new copy in a temporary file and swap the file fork mappings
with the metadata inode.  We've built the atomic extent swap facility,
so now we need to build a facility for handling private temporary files.

The first step is to teach the filesystem to ignore the temporary files.
We'll mark them as PRIVATE in the VFS so that the kernel security
modules will leave it alone.  The second step is to add the online
repair code the ability to create a temporary file and reap extents from
the temporary file after the extent swap.

Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Chandan Babu R <[email protected]>
* tag 'repair-tempfiles-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: add the ability to reap entire inode forks
  xfs: refactor live buffer invalidation for repairs
  xfs: create temporary files and directories for online repair
  xfs: hide private inodes from bulkstat and handle functions

11 months agoMerge tag 'atomic-file-updates-6.10_2024-04-15' of https://git.kernel.org/pub/scm...
Chandan Babu R [Tue, 16 Apr 2024 05:55:09 +0000 (11:25 +0530)]
Merge tag 'atomic-file-updates-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA

xfs: atomic file content exchanges

This series creates a new XFS_IOC_EXCHANGE_RANGE ioctl to exchange
ranges of bytes between two files atomically.

This new functionality enables data storage programs to stage and commit
file updates such that reader programs will see either the old contents
or the new contents in their entirety, with no chance of torn writes.  A
successful call completion guarantees that the new contents will be seen
even if the system fails.

The ability to exchange file fork mappings between files in this manner
is critical to supporting online filesystem repair, which is built upon
the strategy of constructing a clean copy of a damaged structure and
committing the new structure into the metadata file atomically.  The
ioctls exist to facilitate testing of the new functionality and to
enable future application program designs.

User programs will be able to update files atomically by opening an
O_TMPFILE, reflinking the source file to it, making whatever updates
they want to make, and exchange the relevant ranges of the temp file
with the original file.  If the updates are aligned with the file block
size, a new (since v2) flag provides for exchanging only the written
areas.  Note that application software must quiesce writes to the file
while it stages an atomic update.  This will be addressed by a
subsequent series.

This mechanism solves the clunkiness of two existing atomic file update
mechanisms: for O_TRUNC + rewrite, this eliminates the brief period
where other programs can see an empty file.  For create tempfile +
rename, the need to copy file attributes and extended attributes for
each file update is eliminated.

However, this method introduces its own awkwardness -- any program
initiating an exchange now needs to have a way to signal to other
programs that the file contents have changed.  For file access mediated
via read and write, fanotify or inotify are probably sufficient.  For
mmaped files, that may not be fast enough.

Here is the proposed manual page:

IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)

NAME
       ioctl_xfs_exchange_range  -  exchange  the contents of parts of
       two files

SYNOPSIS
       #include <sys/ioctl.h>
       #include <xfs/xfs_fs.h>

       int ioctl(int file2_fd, XFS_IOC_EXCHANGE_RANGE, struct  xfs_ex‐
       change_range *arg);

DESCRIPTION
       Given  a  range  of bytes in a first file file1_fd and a second
       range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
       changes the contents of the two ranges.

       Exchanges  are  atomic  with  regards to concurrent file opera‐
       tions.  Implementations must guarantee that readers see  either
       the old contents or the new contents in their entirety, even if
       the system fails.

       The system call parameters are conveyed in  structures  of  the
       following form:

           struct xfs_exchange_range {
               __s32    file1_fd;
               __u32    pad;
               __u64    file1_offset;
               __u64    file2_offset;
               __u64    length;
               __u64    flags;
           };

       The field pad must be zero.

       The  fields file1_fd, file1_offset, and length define the first
       range of bytes to be exchanged.

       The fields file2_fd, file2_offset, and length define the second
       range of bytes to be exchanged.

       Both  files must be from the same filesystem mount.  If the two
       file descriptors represent the same file, the byte ranges  must
       not  overlap.   Most  disk-based  filesystems  require that the
       starts of both ranges must be aligned to the file  block  size.
       If  this  is  the  case, the ends of the ranges must also be so
       aligned unless the XFS_EXCHANGE_RANGE_TO_EOF flag is set.

       The field flags control the behavior of the exchange operation.

           XFS_EXCHANGE_RANGE_TO_EOF
                  Ignore the length parameter.  All bytes in  file1_fd
                  from  file1_offset to EOF are moved to file2_fd, and
                  file2's size is set to  (file2_offset+(file1_length-
                  file1_offset)).   Meanwhile, all bytes in file2 from
                  file2_offset to EOF are moved to file1  and  file1's
                  size    is   set   to   (file1_offset+(file2_length-
                  file2_offset)).

           XFS_EXCHANGE_RANGE_DSYNC
                  Ensure that all modified in-core data in  both  file
                  ranges  and  all  metadata updates pertaining to the
                  exchange operation are flushed to persistent storage
                  before  the  call  returns.  Opening either file de‐
                  scriptor with O_SYNC or O_DSYNC will have  the  same
                  effect.

           XFS_EXCHANGE_RANGE_FILE1_WRITTEN
                  Only  exchange sub-ranges of file1_fd that are known
                  to contain data  written  by  application  software.
                  Each  sub-range  may  be  expanded (both upwards and
                  downwards) to align with the file  allocation  unit.
                  For files on the data device, this is one filesystem
                  block.  For files on the realtime  device,  this  is
                  the realtime extent size.  This facility can be used
                  to implement fast atomic  scatter-gather  writes  of
                  any  complexity for software-defined storage targets
                  if all writes are aligned  to  the  file  allocation
                  unit.

           XFS_EXCHANGE_RANGE_DRY_RUN
                  Check  the parameters and the feasibility of the op‐
                  eration, but do not change anything.

RETURN VALUE
       On error, -1 is returned, and errno is set to indicate the  er‐
       ror.

ERRORS
       Error  codes can be one of, but are not limited to, the follow‐
       ing:

       EBADF  file1_fd is not open for reading and writing or is  open
              for  append-only  writes;  or  file2_fd  is not open for
              reading and writing or is open for append-only writes.

       EINVAL The parameters are not correct for  these  files.   This
              error  can  also appear if either file descriptor repre‐
              sents a device, FIFO, or socket.  Disk filesystems  gen‐
              erally  require  the  offset  and length arguments to be
              aligned to the fundamental block sizes of both files.

       EIO    An I/O error occurred.

       EISDIR One of the files is a directory.

       ENOMEM The kernel was unable to allocate sufficient  memory  to
              perform the operation.

       ENOSPC There  is  not  enough  free space in the filesystem ex‐
              change the contents safely.

       EOPNOTSUPP
              The filesystem does not support exchanging bytes between
              the two files.

       EPERM  file1_fd or file2_fd are immutable.

       ETXTBSY
              One of the files is a swap file.

       EUCLEAN
              The filesystem is corrupt.

       EXDEV  file1_fd  and  file2_fd  are  not  on  the  same mounted
              filesystem.

CONFORMING TO
       This API is XFS-specific.

USE CASES
       Several use cases are imagined for this system  call.   In  all
       cases, application software must coordinate updates to the file
       because the exchange is performed unconditionally.

       The first is a data storage program that wants to  commit  non-
       contiguous  updates  to a file atomically and coordinates write
       access to that file.  This can be done by creating a  temporary
       file, calling FICLONE(2) to share the contents, and staging the
       updates into the temporary file.  The FULL_FILES flag is recom‐
       mended  for this purpose.  The temporary file can be deleted or
       punched out afterwards.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);

           ioctl(temp_fd, FICLONE, fd);

           /* append 1MB of records */
           lseek(temp_fd, 0, SEEK_END);
           write(temp_fd, data1, 1000000);

           /* update record index */
           pwrite(temp_fd, data1, 600, 98765);
           pwrite(temp_fd, data2, 320, 54321);
           pwrite(temp_fd, data2, 15, 0);

           /* commit the entire update */
           struct xfs_exchange_range args = {
               .file1_fd = temp_fd,
               .flags = XFS_EXCHANGE_RANGE_TO_EOF,
           };

           ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);

       The second is a software-defined  storage  host  (e.g.  a  disk
       jukebox)  which  implements an atomic scatter-gather write com‐
       mand.  Provided the exported disk's logical block size  matches
       the file's allocation unit size, this can be done by creating a
       temporary file and writing the data at the appropriate offsets.
       It  is  recommended that the temporary file be truncated to the
       size of the regular file before any writes are  staged  to  the
       temporary  file  to avoid issues with zeroing during EOF exten‐
       sion.  Use this call with the FILE1_WRITTEN  flag  to  exchange
       only  the  file  allocation  units involved in the emulated de‐
       vice's write command.  The temporary file should  be  truncated
       or  punched out completely before being reused to stage another
       write.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);
           struct stat sb;
           int blksz;

           fstat(fd, &sb);
           blksz = sb.st_blksize;

           /* land scatter gather writes between 100fsb and 500fsb */
           pwrite(temp_fd, data1, blksz * 2, blksz * 100);
           pwrite(temp_fd, data2, blksz * 20, blksz * 480);
           pwrite(temp_fd, data3, blksz * 7, blksz * 257);

           /* commit the entire update */
           struct xfs_exchange_range args = {
               .file1_fd = temp_fd,
               .file1_offset = blksz * 100,
               .file2_offset = blksz * 100,
               .length       = blksz * 400,
               .flags        = XFS_EXCHANGE_RANGE_FILE1_WRITTEN |
                               XFS_EXCHANGE_RANGE_FILE1_DSYNC,
           };

           ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);

NOTES
       Some filesystems may limit the amount of data or the number  of
       extents that can be exchanged in a single call.

SEE ALSO
       ioctl(2)

XFS                           2024-02-10   IOCTL-XFS-EXCHANGE-RANGE(2)

The reference implementation in XFS creates a new log incompat feature
and log intent items to track high level progress of swapping ranges of
two files and finish interrupted work if the system goes down.  Sample
code can be found in the corresponding changes to xfs_io to exercise the
use case mentioned above.

Note that this function is /not/ the O_DIRECT atomic untorn file writes
concept that has also been floating around for years.  It is also not
the RWF_ATOMIC patchset that has been shared.  This RFC is constructed
entirely in software, which means that there are no limitations other
than the general filesystem limits.

As a side note, the original motivation behind the kernel functionality
is online repair of file-based metadata.  The atomic file content
exchange is implemented as an atomic exchange of file fork mappings,
which means that we can implement online reconstruction of extended
attributes and directories by building a new one in another inode and
exchanging the contents.

Subsequent patchsets adapt the online filesystem repair code to use
atomic file exchanges.  This enables repair functions to construct a
clean copy of a directory, xattr information, symbolic links, realtime
bitmaps, and realtime summary information in a temporary inode.  If this
completes successfully, the new contents can be committed atomically
into the inode being repaired.  This is essential to avoid making
corruption problems worse if the system goes down in the middle of
running repair.

For userspace, this series also includes the userspace pieces needed to
test the new functionality, and a sample implementation of atomic file
updates.

Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Chandan Babu R <[email protected]>
* tag 'atomic-file-updates-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: enable logged file mapping exchange feature
  docs: update swapext -> exchmaps language
  xfs: capture inode generation numbers in the ondisk exchmaps log item
  xfs: support non-power-of-two rtextsize with exchange-range
  xfs: make file range exchange support realtime files
  xfs: condense symbolic links after a mapping exchange operation
  xfs: condense directories after a mapping exchange operation
  xfs: condense extended attributes after a mapping exchange operation
  xfs: add error injection to test file mapping exchange recovery
  xfs: bind together the front and back ends of the file range exchange code
  xfs: create deferred log items for file mapping exchanges
  xfs: introduce a file mapping exchange log intent item
  xfs: create a incompat flag for atomic file mapping exchanges
  xfs: introduce new file range exchange ioctl
  vfs: export remap and write check helpers

11 months agoMerge tag 'file-exchange-refactorings-6.10_2024-04-15' of https://git.kernel.org...
Chandan Babu R [Tue, 16 Apr 2024 05:47:28 +0000 (11:17 +0530)]
Merge tag 'file-exchange-refactorings-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA

xfs: refactorings for atomic file content exchanges

This series applies various cleanups and refactorings to file IO
handling code ahead of the main series to implement atomic file content
exchanges.

Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Chandan Babu R <[email protected]>
* tag 'file-exchange-refactorings-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: constify xfs_bmap_is_written_extent
  xfs: refactor non-power-of-two alignment checks
  xfs: hoist multi-fsb allocation unit detection to a helper
  xfs: create a new helper to return a file's allocation unit
  xfs: declare xfs_file.c symbols in xfs_file.h
  xfs: move xfs_iops.c declarations out of xfs_inode.h
  xfs: move inode lease breaking functions to xfs_inode.c

11 months agoMerge tag 'log-incompat-permissions-6.10_2024-04-15' of https://git.kernel.org/pub...
Chandan Babu R [Tue, 16 Apr 2024 05:39:58 +0000 (11:09 +0530)]
Merge tag 'log-incompat-permissions-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA

xfs: improve log incompat feature handling

This patchset improves the performance of log incompat feature bit
handling by making a few changes to how the filesystem handles them.
First, we now only clear the bits during a clean unmount to reduce calls
to the (expensive) upgrade function to once per bit per mount.  Second,
we now only allow incompat feature upgrades for sysadmins or if the
sysadmin explicitly allows it via mount option.  Currently the only log
incompat user is logged xattrs, which requires CONFIG_XFS_DEBUG=y, so
there should be no user visible impact to this change.

Signed-off-by: Darrick J. Wong <[email protected]>
Signed-off-by: Chandan Babu R <[email protected]>
* tag 'log-incompat-permissions-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: only clear log incompat flags at clean unmount
  xfs: fix error bailout in xrep_abt_build_new_trees
  xfs: fix potential AGI <-> ILOCK ABBA deadlock in xrep_dinode_findmode_walk_directory
  xfs: fix an AGI lock acquisition ordering problem in xrep_dinode_findmode
  xfs: pass xfs_buf lookup flags to xfs_*read_agi

11 months agoxfs: online repair of symbolic links
Darrick J. Wong [Mon, 15 Apr 2024 21:54:59 +0000 (14:54 -0700)]
xfs: online repair of symbolic links

If a symbolic link target looks bad, try to sift through the rubble to
find as much of the target buffer that we can, and stage a new target
(short or remote format as needed) in a temporary file and use the
atomic extent swapping mechanism to commit the results.  In the worst
case, we replace the target with an overly long filename that cannot
possibly resolve.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: ensure dentry consistency when the orphanage adopts a file
Darrick J. Wong [Mon, 15 Apr 2024 21:54:57 +0000 (14:54 -0700)]
xfs: ensure dentry consistency when the orphanage adopts a file

When the orphanage adopts a file, that file becomes a child of the
orphanage.  The dentry cache may have entries for the orphanage
directory and the name we've chosen, so (1) make sure we abort if the
dcache has a positive entry because something's not right; and (2)
invalidate and purge negative dentries if the adoption goes through.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: pass the owner to xfs_symlink_write_target
Darrick J. Wong [Mon, 15 Apr 2024 21:54:59 +0000 (14:54 -0700)]
xfs: pass the owner to xfs_symlink_write_target

Require callers of xfs_symlink_write_target to pass the owner number
explicitly.  This sets us up for online repair to be able to write a
remote symlink target to sc->tempip with sc->ip's inumber in the block
heaader.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: move files to orphanage instead of letting nlinks drop to zero
Darrick J. Wong [Mon, 15 Apr 2024 21:54:56 +0000 (14:54 -0700)]
xfs: move files to orphanage instead of letting nlinks drop to zero

If we encounter an inode with a nonzero link count but zero observed
links, move it to the orphanage.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: expose xfs_bmap_local_to_extents for online repair
Darrick J. Wong [Mon, 15 Apr 2024 21:54:58 +0000 (14:54 -0700)]
xfs: expose xfs_bmap_local_to_extents for online repair

Allow online repair to call xfs_bmap_local_to_extents and add a void *
argument at the end so that online repair can pass its own context.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: ask the dentry cache if it knows the parent of a directory
Darrick J. Wong [Mon, 15 Apr 2024 21:54:54 +0000 (14:54 -0700)]
xfs: ask the dentry cache if it knows the parent of a directory

It's possible that the dentry cache can tell us the parent of a
directory.  Therefore, when repairing directory dot dot entries, query
the dcache as a last resort before scanning the entire filesystem.

A reviewer asks:

"How high is the chance that we actually have a valid dcache entry for a
file in a corrupted directory?"

There's a decent chance of this actually working.  Say you have a
1000-block directory foo, and block 980 gets corrupted.  Let's further
suppose that block 0 has a correct entry for ".." and "bar".  If someone
accesses /mnt/foo/bar, that will cause the dcache to create a dentry
from /mnt to /mnt/foo whose d_parent points back to /mnt.  If you then
want to rebuild the directory, XFS can obtain the parent from the dcache
without needing to wander into parent pointers or scan the filesystem to
find /mnt's connection to foo.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: move orphan files to the orphanage
Darrick J. Wong [Mon, 15 Apr 2024 21:54:55 +0000 (14:54 -0700)]
xfs: move orphan files to the orphanage

When we're repairing a directory structure or fixing the dotdot entry of
a subdirectory, it's possible that we won't ever find a parent for the
subdirectory.  When this is the case, move it to the orphanage, aka
/lost+found.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: online repair of parent pointers
Darrick J. Wong [Mon, 15 Apr 2024 21:54:53 +0000 (14:54 -0700)]
xfs: online repair of parent pointers

Teach the online repair code to fix parent pointers for directories.
For now, this means correcting the dotdot entry of an existing directory
that is otherwise consistent.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: scan the filesystem to repair a directory dotdot entry
Darrick J. Wong [Mon, 15 Apr 2024 21:54:52 +0000 (14:54 -0700)]
xfs: scan the filesystem to repair a directory dotdot entry

Teach the online directory repair code to scan the filesystem so that we
can set the dotdot entry when we're rebuilding a directory.  This
involves dropping ILOCK on the directory that we're repairing, which
means that the VFS can sneak in and tell us to update dotdot at any
time.  Deal with these races by using a dirent hook to absorb dotdot
updates, and be careful not to check the scan results until after we've
retaken the ILOCK.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: update the unlinked list when repairing link counts
Darrick J. Wong [Mon, 15 Apr 2024 21:54:49 +0000 (14:54 -0700)]
xfs: update the unlinked list when repairing link counts

When we're repairing the link counts of a file, we must ensure either
that the file has zero link count and is on the unlinked list; or that
it has nonzero link count and is not on the unlinked list.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: online repair of directories
Darrick J. Wong [Mon, 15 Apr 2024 21:54:51 +0000 (14:54 -0700)]
xfs: online repair of directories

If a directory looks like it's in bad shape, try to sift through the
rubble to find whatever directory entries we can, scan the directory
tree for the parent (if needed), stage the new directory contents in a
temporary file and use the atomic extent swapping mechanism to commit
the results in bulk.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: inactivate directory data blocks
Darrick J. Wong [Mon, 15 Apr 2024 21:54:50 +0000 (14:54 -0700)]
xfs: inactivate directory data blocks

Teach inode inactivation to delete all the incore buffers backing a
directory.  In normal runtime this should never happen because the VFS
forbids rmdir on a non-empty directory.

In the next patch, online directory repair stands up a new directory,
exchanges it with the broken directory, and then drops the private
temporary directory.  If we cancel the repair just prior to exchanging
the directory contents, the new directory will need to be torn down.
Note: If we commit the repair, reaping will take care of all the ondisk
space allocations and incore buffers for the old corrupt directory.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: create an xattr iteration function for scrub
Darrick J. Wong [Mon, 15 Apr 2024 21:54:48 +0000 (14:54 -0700)]
xfs: create an xattr iteration function for scrub

Create a streamlined function to walk a file's xattrs, without all the
cursor management stuff in the regular listxattr.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: ensure unlinked list state is consistent with nlink during scrub
Darrick J. Wong [Mon, 15 Apr 2024 21:54:49 +0000 (14:54 -0700)]
xfs: ensure unlinked list state is consistent with nlink during scrub

Now that we have the means to tell if an inode is on an unlinked inode
list or not, we can check that an inode with zero link count is on the
unlinked list; and an inode that has nonzero link count is not on that
list.  Make repair clean things up too.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: flag empty xattr leaf blocks for optimization
Darrick J. Wong [Mon, 15 Apr 2024 21:54:47 +0000 (14:54 -0700)]
xfs: flag empty xattr leaf blocks for optimization

Empty xattr leaf blocks at offset zero are a waste of space but
otherwise harmless.  If we encounter one, flag it as an opportunity for
optimization.

If we encounter empty attr leaf blocks anywhere else in the attr fork,
that's corruption.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: scrub should set preen if attr leaf has holes
Darrick J. Wong [Mon, 15 Apr 2024 21:54:46 +0000 (14:54 -0700)]
xfs: scrub should set preen if attr leaf has holes

If an attr block indicates that it could use compaction, set the preen
flag to have the attr fork rebuilt, since the attr fork rebuilder can
take care of that for us.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: repair extended attributes
Darrick J. Wong [Mon, 15 Apr 2024 21:54:45 +0000 (14:54 -0700)]
xfs: repair extended attributes

If the extended attributes look bad, try to sift through the rubble to
find whatever keys/values we can, stage a new attribute structure in a
temporary file and use the atomic extent swapping mechanism to commit
the results in bulk.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: use atomic extent swapping to fix user file fork data
Darrick J. Wong [Mon, 15 Apr 2024 21:54:44 +0000 (14:54 -0700)]
xfs: use atomic extent swapping to fix user file fork data

Build on the code that was recently added to the temporary repair file
code so that we can atomically switch the contents of any file fork,
even if the fork is in local format.  The upcoming functions to repair
xattrs, directories, and symlinks will need that capability.

Repair can lock out access to these user files by holding IOLOCK_EXCL on
these user files.  Therefore, it is safe to drop the ILOCK of both the
file being repaired and the tempfile being used for staging, and cancel
the scrub transaction.  We do this so that we can reuse the resource
estimation and transaction allocation functions used by a regular file
exchange operation.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: create a blob array data structure
Darrick J. Wong [Mon, 15 Apr 2024 21:54:43 +0000 (14:54 -0700)]
xfs: create a blob array data structure

Create a simple 'blob array' data structure for storage of arbitrarily
sized metadata objects that will be used to reconstruct metadata.  For
the intended usage (temporarily storing extended attribute names and
values) we only have to support storing objects and retrieving them.
Use the xfile abstraction to store the attribute information in memory
that can be swapped out.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: enable discarding of folios backing an xfile
Darrick J. Wong [Mon, 15 Apr 2024 21:54:42 +0000 (14:54 -0700)]
xfs: enable discarding of folios backing an xfile

Create a new xfile function to discard the page cache that's backing
part of an xfile.  The next patch wil use this to drop parts of an xfile
that aren't needed anymore.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: validate explicit directory free block owners
Darrick J. Wong [Mon, 15 Apr 2024 21:54:41 +0000 (14:54 -0700)]
xfs: validate explicit directory free block owners

Port the existing directory freespace block header checking function to
accept an owner number instead of an xfs_inode, then update the
callsites to use xfs_da_args.owner when possible.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: validate explicit directory block buffer owners
Darrick J. Wong [Mon, 15 Apr 2024 21:54:41 +0000 (14:54 -0700)]
xfs: validate explicit directory block buffer owners

Port the existing directory block header checking function to accept an
owner number instead of an xfs_inode, then update the callsites to use
xfs_da_args.owner when possible.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: validate explicit directory data buffer owners
Darrick J. Wong [Mon, 15 Apr 2024 21:54:40 +0000 (14:54 -0700)]
xfs: validate explicit directory data buffer owners

Port the existing directory data header checking function to accept an
owner number instead of an xfs_inode, then update the callsites to use
xfs_da_args.owner when possible.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: validate directory leaf buffer owners
Darrick J. Wong [Mon, 15 Apr 2024 21:54:39 +0000 (14:54 -0700)]
xfs: validate directory leaf buffer owners

Check the owner field of directory leaf blocks.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: validate dabtree node buffer owners
Darrick J. Wong [Mon, 15 Apr 2024 21:54:38 +0000 (14:54 -0700)]
xfs: validate dabtree node buffer owners

Check the owner field of dabtree node blocks.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: validate attr remote value buffer owners
Darrick J. Wong [Mon, 15 Apr 2024 21:54:37 +0000 (14:54 -0700)]
xfs: validate attr remote value buffer owners

Check the owner field of xattr remote value blocks.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: validate attr leaf buffer owners
Darrick J. Wong [Mon, 15 Apr 2024 21:54:36 +0000 (14:54 -0700)]
xfs: validate attr leaf buffer owners

Create a leaf block header checking function to validate the owner field
of xattr leaf blocks.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: reduce indenting in xfs_attr_node_list
Darrick J. Wong [Mon, 15 Apr 2024 21:54:35 +0000 (14:54 -0700)]
xfs: reduce indenting in xfs_attr_node_list

Reduce the indentation here so that we can add some things in the next
patch without going over the column limits.

Suggested-by: Christoph Hellwig <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: use the xfs_da_args owner field to set new dir/attr block owner
Darrick J. Wong [Mon, 15 Apr 2024 21:54:34 +0000 (14:54 -0700)]
xfs: use the xfs_da_args owner field to set new dir/attr block owner

When we're creating leaf, data, freespace, or dabtree blocks for
directories and xattrs, use the explicit owner field (instead of the
xfs_inode) to set the owner field.  This will enable online repair to
construct replacement data structures in a temporary file without having
to change the owner fields prior to swapping the new and old structures.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: add an explicit owner field to xfs_da_args
Darrick J. Wong [Mon, 15 Apr 2024 21:54:34 +0000 (14:54 -0700)]
xfs: add an explicit owner field to xfs_da_args

Add an explicit owner field to xfs_da_args, which will make it easier
for online fsck to set the owner field of the temporary directory and
xattr structures that it builds to repair damaged metadata.

Note: I hopefully found all the xfs_da_args definitions by looking for
automatic stack variable declarations and xfs_da_args.dp assignments:

git grep -E '(args.*dp =|struct xfs_da_args[[:space:]]*[a-z0-9][a-z0-9]*)'

Note that callers of xfs_attr_{get,set,change} can set the owner to zero
(or leave it unset) to have the default set to args->dp.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: online repair of realtime summaries
Darrick J. Wong [Mon, 15 Apr 2024 21:54:33 +0000 (14:54 -0700)]
xfs: online repair of realtime summaries

Repair the realtime summary data by constructing a new rtsummary file in
the scrub temporary file, then atomically swapping the contents.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: add the ability to reap entire inode forks
Darrick J. Wong [Mon, 15 Apr 2024 21:54:30 +0000 (14:54 -0700)]
xfs: add the ability to reap entire inode forks

In preparation for supporting repair of indexed file-based metadata
(such as realtime bitmaps, directories, and extended attribute data),
add a function to reap the old blocks after a metadata repair finishes.
IOWs, this is an elaborate bunmapi call that deals with crosslinked
blocks by unmapping them without freeing them, and also scans for incore
buffers to invalidate.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: teach the tempfile to set up atomic file content exchanges
Darrick J. Wong [Mon, 15 Apr 2024 21:54:32 +0000 (14:54 -0700)]
xfs: teach the tempfile to set up atomic file content exchanges

Create some new routines to exchange the contents of a temporary file
created to stage a repair with another ondisk file.  This will be used
by the realtime summary repair function to commit atomically the new
rtsummary data, which will be staged in the tempfile.

The rest of XFS coordinates access to the realtime metadata inodes
solely through the ILOCK.  For repair to hold its exclusive access to
the realtime summary file, it has to allocate a single large transaction
and roll it repeatedly throughout the repair while holding the ILOCK.
In turn, this means that for now there's only a partial file mapping
exchange implementation for the temporary file because we can only work
within an existing transaction.

For now, the only tempswap functions needed here are to estimate the
resource requirements of the exchange, reserve more space/quota to an
existing transaction, and kick off the actual exchange.  The rest will
be added in a later patch in preparation for repairing xattrs and
directories.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: support preallocating and copying content into temporary files
Darrick J. Wong [Mon, 15 Apr 2024 21:54:31 +0000 (14:54 -0700)]
xfs: support preallocating and copying content into temporary files

Create the routines we need to preallocate space in a temporary ondisk
file and then copy the contents of an xfile into the tempfile.  The
upcoming rtsummary repair feature will construct the contents of a
realtime summary file in memory, after which it will want to copy all
that into the ondisk temporary file before atomically committing the new
rtsummary contents.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: refactor live buffer invalidation for repairs
Darrick J. Wong [Mon, 15 Apr 2024 21:54:29 +0000 (14:54 -0700)]
xfs: refactor live buffer invalidation for repairs

In an upcoming patch, we will need to be able to look for xfs_buf
objects caching file-based metadata blocks without needing to walk the
(possibly corrupt) structures to find all the buffers.  Repair already
has most of the code needed to scan the buffer cache, so hoist these
utility functions.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: create temporary files and directories for online repair
Darrick J. Wong [Mon, 15 Apr 2024 21:54:28 +0000 (14:54 -0700)]
xfs: create temporary files and directories for online repair

Teach the online repair code how to create temporary files or
directories.  These temporary files can be used to stage reconstructed
information until we're ready to perform an atomic extent swap to commit
the new metadata.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: hide private inodes from bulkstat and handle functions
Darrick J. Wong [Mon, 15 Apr 2024 21:54:27 +0000 (14:54 -0700)]
xfs: hide private inodes from bulkstat and handle functions

We're about to start adding functionality that uses internal inodes that
are private to XFS.  What this means is that userspace should never be
able to access any information about these files, and should not be able
to open these files by handle.

To prevent users from ever finding the file or mis-interactions with the
security apparatus, set S_PRIVATE on the inode.  Don't allow bulkstat,
open-by-handle, or linking of S_PRIVATE files into the directory tree.
This should keep private inodes actually private.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: enable logged file mapping exchange feature
Darrick J. Wong [Mon, 15 Apr 2024 21:54:26 +0000 (14:54 -0700)]
xfs: enable logged file mapping exchange feature

Add the XFS_SB_FEAT_INCOMPAT_EXCHRANGE feature to the set of features
that we will permit when mounting a filesystem.  This turns on support
for the file range exchange feature.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agodocs: update swapext -> exchmaps language
Darrick J. Wong [Mon, 15 Apr 2024 21:54:25 +0000 (14:54 -0700)]
docs: update swapext -> exchmaps language

Start reworking the atomic swapext design documentation to refer to its
new file contents/mapping exchange name.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: capture inode generation numbers in the ondisk exchmaps log item
Darrick J. Wong [Mon, 15 Apr 2024 21:54:24 +0000 (14:54 -0700)]
xfs: capture inode generation numbers in the ondisk exchmaps log item

Per some very late review comments, capture the generation numbers of
both inodes involved in a file content exchange operation so that we
don't accidentally target files with have been reallocated.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: support non-power-of-two rtextsize with exchange-range
Darrick J. Wong [Mon, 15 Apr 2024 21:54:23 +0000 (14:54 -0700)]
xfs: support non-power-of-two rtextsize with exchange-range

The generic exchange-range alignment checks use (fast) bitmasking
operations to perform block alignment checks on the exchange parameters.
Unfortunately, bitmasks require that the alignment size be a power of
two.  This isn't true for realtime devices with a non-power-of-two
extent size, so we have to copy-pasta the generic checks using long
division for this to work properly.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: make file range exchange support realtime files
Darrick J. Wong [Mon, 15 Apr 2024 21:54:22 +0000 (14:54 -0700)]
xfs: make file range exchange support realtime files

Now that bmap items support the realtime device, we can add the
necessary pieces to the file range exchange code to support exchanging
mappings.  All we really need to do here is adjust the blockcount
upwards to the end of the rt extent and remove the inode checks.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: condense symbolic links after a mapping exchange operation
Darrick J. Wong [Mon, 15 Apr 2024 21:54:21 +0000 (14:54 -0700)]
xfs: condense symbolic links after a mapping exchange operation

The previous commit added a new file mapping exchange flag that enables
us to perform post-exchange processing on file2 once we're done
exchanging the extent mappings.  Now add this ability for symlinks.

This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online symlink repair feature can
salvage the remote target in a temporary link and exchange the data fork
mappings when ready.  If one file is in extents format and the other is
inline, we will have to promote both to extents format to perform the
exchange.  After the exchange, we can try to condense the fixed symlink
down to inline format if possible.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: condense directories after a mapping exchange operation
Darrick J. Wong [Mon, 15 Apr 2024 21:54:20 +0000 (14:54 -0700)]
xfs: condense directories after a mapping exchange operation

The previous commit added a new file mapping exchange flag that enables
us to perform post-swap processing on file2 once we're done exchanging
extent mappings.  Now add this ability for directories.

This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online directory repair feature can
create salvaged dirents in a temporary directory and exchange the data
fork mappings when ready.  If one file is in extents format and the
other is inline, we will have to promote both to extents format to
perform the exchange.  After the exchange, we can try to condense the
fixed directory down to inline format if possible.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: condense extended attributes after a mapping exchange operation
Darrick J. Wong [Mon, 15 Apr 2024 21:54:20 +0000 (14:54 -0700)]
xfs: condense extended attributes after a mapping exchange operation

Add a new file mapping exchange flag that enables us to perform
post-exchange processing on file2 once we're done exchanging the extent
mappings.  If we were swapping mappings between extended attribute
forks, we want to be able to convert file2's attr fork from block to
inline format.

(This implies that all fork contents are exchanged.)

This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online xattr repair feature can create
salvaged attrs in a temporary file and exchange the attr fork mappings
when ready.  If one file is in extents format and the other is inline,
we will have to promote both to extents format to perform the exchange.
After the exchange, we can try to condense the fixed file's attr fork
back down to inline format if possible.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: add error injection to test file mapping exchange recovery
Darrick J. Wong [Mon, 15 Apr 2024 21:54:19 +0000 (14:54 -0700)]
xfs: add error injection to test file mapping exchange recovery

Add an errortag so that we can test recovery of exchmaps log items.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: bind together the front and back ends of the file range exchange code
Darrick J. Wong [Mon, 15 Apr 2024 21:54:18 +0000 (14:54 -0700)]
xfs: bind together the front and back ends of the file range exchange code

So far, we've constructed the front end of the file range exchange code
that does all the checking; and the back end of the file mapping
exchange code that actually does the work.  Glue these two pieces
together so that we can turn on the functionality.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: create deferred log items for file mapping exchanges
Darrick J. Wong [Mon, 15 Apr 2024 21:54:17 +0000 (14:54 -0700)]
xfs: create deferred log items for file mapping exchanges

Now that we've created the skeleton of a log intent item to track and
restart file mapping exchange operations, add the upper level logic to
commit intent items and turn them into concrete work recorded in the
log.  This builds on the existing bmap update intent items that have
been around for a while now.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: introduce a file mapping exchange log intent item
Darrick J. Wong [Mon, 15 Apr 2024 21:54:16 +0000 (14:54 -0700)]
xfs: introduce a file mapping exchange log intent item

Introduce a new intent log item to handle exchanging mappings between
the forks of two files.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: create a incompat flag for atomic file mapping exchanges
Darrick J. Wong [Mon, 15 Apr 2024 21:54:15 +0000 (14:54 -0700)]
xfs: create a incompat flag for atomic file mapping exchanges

Create a incompat flag so that we only attempt to process file mapping
exchange log items if the filesystem supports it, and a geometry flag to
advertise support if it's present.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: introduce new file range exchange ioctl
Darrick J. Wong [Mon, 15 Apr 2024 21:54:14 +0000 (14:54 -0700)]
xfs: introduce new file range exchange ioctl

Introduce a new ioctl to handle exchanging ranges of bytes
between files.  The goal here is to perform the exchange atomically with
respect to applications -- either they see the file contents before the
exchange or they see that A-B is now B-A, even if the kernel crashes.

My original goal with all this code was to make it so that online repair
can build a replacement directory or xattr structure in a temporary file
and commit the repair by atomically exchanging all the data blocks
between the two files.  However, I needed a way to test this mechanism
thoroughly, so I've been evolving an ioctl interface since then.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agovfs: export remap and write check helpers
Darrick J. Wong [Mon, 15 Apr 2024 21:54:13 +0000 (14:54 -0700)]
vfs: export remap and write check helpers

Export these functions so that the next patch can use them to check the
file ranges being passed to the XFS_IOC_EXCHANGE_RANGE operation.

Cc: [email protected]
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: constify xfs_bmap_is_written_extent
Darrick J. Wong [Mon, 15 Apr 2024 21:54:12 +0000 (14:54 -0700)]
xfs: constify xfs_bmap_is_written_extent

This predicate doesn't modify the structure that's being passed in, so
we can mark it const.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: refactor non-power-of-two alignment checks
Darrick J. Wong [Mon, 15 Apr 2024 21:54:12 +0000 (14:54 -0700)]
xfs: refactor non-power-of-two alignment checks

Create a helper function that can compute if a 64-bit number is an
integer multiple of a 32-bit number, where the 32-bit number is not
required to be an even power of two.  This is needed for some new code
for the realtime device, where we can set 37k allocation units and then
have to remap them.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: hoist multi-fsb allocation unit detection to a helper
Darrick J. Wong [Mon, 15 Apr 2024 21:54:11 +0000 (14:54 -0700)]
xfs: hoist multi-fsb allocation unit detection to a helper

Replace the open-coded logic to decide if a file has a multi-fsb
allocation unit to a helper to make the code easier to read.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: create a new helper to return a file's allocation unit
Darrick J. Wong [Mon, 15 Apr 2024 21:54:10 +0000 (14:54 -0700)]
xfs: create a new helper to return a file's allocation unit

Create a new helper function to calculate the fundamental allocation
unit (i.e. the smallest unit of space we can allocate) of a file.
Things are going to get hairy with range-exchange on the realtime
device, so prepare for this now.

Remove the static attribute from xfs_is_falloc_aligned since the next
patch will need it.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: declare xfs_file.c symbols in xfs_file.h
Darrick J. Wong [Mon, 15 Apr 2024 21:54:09 +0000 (14:54 -0700)]
xfs: declare xfs_file.c symbols in xfs_file.h

Move the two public symbols in xfs_file.c to xfs_file.h.  We're about to
add more public symbols in that source file, so let's finally create the
header file.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: move xfs_iops.c declarations out of xfs_inode.h
Darrick J. Wong [Mon, 15 Apr 2024 21:54:08 +0000 (14:54 -0700)]
xfs: move xfs_iops.c declarations out of xfs_inode.h

Similarly, move declarations of public symbols of xfs_iops.c from
xfs_inode.h to xfs_iops.h.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: move inode lease breaking functions to xfs_inode.c
Darrick J. Wong [Mon, 15 Apr 2024 21:54:07 +0000 (14:54 -0700)]
xfs: move inode lease breaking functions to xfs_inode.c

The lease breaking functions operate at the scope of the entire VFS
inode, not subranges of a file.  Move them to xfs_inode.c since they're
already declared in xfs_inode.h.  This cleanup moves us closer to
having xfs_FOO.h declare only the symbols in xfs_FOO.c.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: only clear log incompat flags at clean unmount
Darrick J. Wong [Mon, 15 Apr 2024 21:54:06 +0000 (14:54 -0700)]
xfs: only clear log incompat flags at clean unmount

While reviewing the online fsck patchset, someone spied the
xfs_swapext_can_use_without_log_assistance function and wondered why we
go through this inverted-bitmask dance to avoid setting the
XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT feature.

(The same principles apply to the logged extended attribute update
feature bit in the since-merged LARP series.)

The reason for this dance is that xfs_add_incompat_log_feature is an
expensive operation -- it forces the log, pushes the AIL, and then if
nobody's beaten us to it, sets the feature bit and issues a synchronous
write of the primary superblock.  That could be a one-time cost
amortized over the life of the filesystem, but the log quiesce and cover
operations call xfs_clear_incompat_log_features to remove feature bits
opportunistically.  On a moderately loaded filesystem this leads to us
cycling those bits on and off over and over, which hurts performance.

Why do we clear the log incompat bits?  Back in ~2020 I think Dave and I
had a conversation on IRC[2] about what the log incompat bits represent.
IIRC in that conversation we decided that the log incompat bits protect
unrecovered log items so that old kernels won't try to recover them and
barf.  Since a clean log has no protected log items, we could clear the
bits at cover/quiesce time.

As Dave Chinner pointed out in the thread, clearing log incompat bits at
unmount time has positive effects for golden root disk image generator
setups, since the generator could be running a newer kernel than what
gets written to the golden image -- if there are log incompat fields set
in the golden image that was generated by a newer kernel/OS image
builder then the provisioning host cannot mount the filesystem even
though the log is clean and recovery is unnecessary to mount the
filesystem.

Given that it's expensive to set log incompat bits, we really only want
to do that once per bit per mount.  Therefore, I propose that we only
clear log incompat bits as part of writing a clean unmount record.  Do
this by adding an operational state flag to the xfs mount that guards
whether or not the feature bit clearing can actually take place.

This eliminates the l_incompat_users rwsem that we use to protect a log
cleaning operation from clearing a feature bit that a frontend thread is
trying to set -- this lock adds another way to fail w.r.t. locking.  For
the swapext series, I shard that into multiple locks just to work around
the lockdep complaints, and that's fugly.

Link: https://lore.kernel.org/linux-xfs/20240131230043.GA6180@frogsfrogsfrogs/
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>
11 months agoxfs: fix error bailout in xrep_abt_build_new_trees
Darrick J. Wong [Mon, 15 Apr 2024 21:54:06 +0000 (14:54 -0700)]
xfs: fix error bailout in xrep_abt_build_new_trees

Dan Carpenter reports:

"Commit 4bdfd7d15747 ("xfs: repair free space btrees") from Dec 15,
2023 (linux-next), leads to the following Smatch static checker
warning:

        fs/xfs/scrub/alloc_repair.c:781 xrep_abt_build_new_trees()
        warn: missing unwind goto?"

That's a bug, so let's fix it.

Reported-by: Dan Carpenter <[email protected]>
Fixes: 4bdfd7d15747 ("xfs: repair free space btrees")
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: fix potential AGI <-> ILOCK ABBA deadlock in xrep_dinode_findmode_walk_directory
Darrick J. Wong [Mon, 15 Apr 2024 21:54:05 +0000 (14:54 -0700)]
xfs: fix potential AGI <-> ILOCK ABBA deadlock in xrep_dinode_findmode_walk_directory

xfs/399 found the following deadlock when fuzzing core.mode = ones:

/proc/20506/task/20558/stack :
[<0>] xfs_ilock+0xa0/0x240 [xfs]
[<0>] xfs_ilock_data_map_shared+0x1b/0x20 [xfs]
[<0>] xrep_dinode_findmode_walk_directory+0x69/0xe0 [xfs]
[<0>] xrep_dinode_find_mode+0x103/0x2a0 [xfs]
[<0>] xrep_dinode_mode+0x7c/0x120 [xfs]
[<0>] xrep_dinode_core+0xed/0x2b0 [xfs]
[<0>] xrep_dinode_problems+0x10/0x80 [xfs]
[<0>] xrep_inode+0x6c/0xc0 [xfs]
[<0>] xrep_attempt+0x64/0x1d0 [xfs]
[<0>] xfs_scrub_metadata+0x365/0x840 [xfs]
[<0>] xfs_scrubv_metadata+0x282/0x430 [xfs]
[<0>] xfs_ioc_scrubv_metadata+0x149/0x1a0 [xfs]
[<0>] xfs_file_ioctl+0xc68/0x1780 [xfs]
/proc/20506/task/20559/stack :
[<0>] xfs_buf_lock+0x3b/0x110 [xfs]
[<0>] xfs_buf_find_lock+0x66/0x1c0 [xfs]
[<0>] xfs_buf_get_map+0x208/0xc00 [xfs]
[<0>] xfs_buf_read_map+0x5d/0x2c0 [xfs]
[<0>] xfs_trans_read_buf_map+0x1b0/0x4c0 [xfs]
[<0>] xfs_read_agi+0xbd/0x190 [xfs]
[<0>] xfs_ialloc_read_agi+0x47/0x160 [xfs]
[<0>] xfs_imap_lookup+0x69/0x1f0 [xfs]
[<0>] xfs_imap+0x1fc/0x3d0 [xfs]
[<0>] xfs_iget+0x357/0xd50 [xfs]
[<0>] xchk_dir_actor+0x16e/0x330 [xfs]
[<0>] xchk_dir_walk_block+0x164/0x1e0 [xfs]
[<0>] xchk_dir_walk+0x13a/0x190 [xfs]
[<0>] xchk_directory+0x1a2/0x2b0 [xfs]
[<0>] xfs_scrub_metadata+0x2f4/0x840 [xfs]
[<0>] xfs_scrubv_metadata+0x282/0x430 [xfs]
[<0>] xfs_ioc_scrubv_metadata+0x149/0x1a0 [xfs]
[<0>] xfs_file_ioctl+0xc68/0x1780 [xfs]

Thread 20558 holds an AGI buffer and is trying to grab the ILOCK of the
root directory.  Thread 20559 holds the root directory ILOCK and is
trying to grab the AGI of an inode that is one of the root directory's
children.  The AGI held by 20558 is the same buffer that 20559 is trying
to acquire.  In other words, this is an ABBA deadlock.

In general, the lock order is ILOCK and then AGI -- rename does this
while preparing for an operation involving whiteouts or renaming files
out of existence; and unlink does this when moving an inode to the
unlinked list.  The only place where we do it in the opposite order is
on the child during an icreate, but at that point the child is marked
INEW and is not visible to other threads.

Work around this deadlock by replacing the blocking ilock attempt with a
nonblocking loop that aborts after 30 seconds.  Relax for a jiffy after
a failed lock attempt.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: fix an AGI lock acquisition ordering problem in xrep_dinode_findmode
Darrick J. Wong [Mon, 15 Apr 2024 21:54:04 +0000 (14:54 -0700)]
xfs: fix an AGI lock acquisition ordering problem in xrep_dinode_findmode

While reviewing the next patch which fixes an ABBA deadlock between the
AGI and a directory ILOCK, someone asked a question about why we're
holding the AGI in the first place.  The reason for that is to quiesce
the inode structures for that AG while we do a repair.

I then realized that the xrep_dinode_findmode invokes xchk_iscan_iter,
which walks the inobts (and hence the AGIs) to find all the inodes.
This itself is also an ABBA vector, since the damaged inode could be in
AG 5, which we hold while we scan AG 0 for directories.  5 -> 0 is not
allowed.

To address this, modify the iscan to allow trylock of the AGI buffer
using the flags argument to xfs_ialloc_read_agi that the previous patch
added.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoxfs: pass xfs_buf lookup flags to xfs_*read_agi
Darrick J. Wong [Mon, 15 Apr 2024 21:54:03 +0000 (14:54 -0700)]
xfs: pass xfs_buf lookup flags to xfs_*read_agi

Allow callers to pass buffer lookup flags to xfs_read_agi and
xfs_ialloc_read_agi.  This will be used in the next patch to fix a
deadlock in the online fsck inode scanner.

Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
11 months agoLinux 6.9-rc4 v6.9-rc4
Linus Torvalds [Sun, 14 Apr 2024 20:38:39 +0000 (13:38 -0700)]
Linux 6.9-rc4

11 months agoMerge tag 'pull-sysfs-annotation-fix' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 14 Apr 2024 18:41:51 +0000 (11:41 -0700)]
Merge tag 'pull-sysfs-annotation-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull sysfs fix from Al Viro:
 "Get rid of lockdep false positives around sysfs/overlayfs

  syzbot has uncovered a class of lockdep false positives for setups
  with sysfs being one of the backing layers in overlayfs. The root
  cause is that of->mutex allocated when opening a sysfs file read-only
  (which overlayfs might do) is confused with of->mutex of a file opened
  writable (held in write to sysfs file, which overlayfs won't do).

  Assigning them separate lockdep classes fixes that bunch and it's
  obviously safe"

* tag 'pull-sysfs-annotation-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kernfs: annotate different lockdep class for of->mutex of writable files

11 months agoMerge tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 14 Apr 2024 17:48:51 +0000 (10:48 -0700)]
Merge tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull misc x86 fixes from Ingo Molnar:

 - Follow up fixes for the BHI mitigations code

 - Fix !SPECULATION_MITIGATIONS bug not turning off mitigations as
   expected

 - Work around an APIC emulation bug when the kernel is built with Clang
   and run as a SEV guest

 - Follow up x86 topology fixes

* tag 'x86-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu/amd: Move TOPOEXT enablement into the topology parser
  x86/cpu/amd: Make the NODEID_MSR union actually work
  x86/cpu/amd: Make the CPUID 0x80000008 parser correct
  x86/bugs: Replace CONFIG_SPECTRE_BHI_{ON,OFF} with CONFIG_MITIGATION_SPECTRE_BHI
  x86/bugs: Remove CONFIG_BHI_MITIGATION_AUTO and spectre_bhi=auto
  x86/bugs: Clarify that syscall hardening isn't a BHI mitigation
  x86/bugs: Fix BHI handling of RRSBA
  x86/bugs: Rename various 'ia32_cap' variables to 'x86_arch_cap_msr'
  x86/bugs: Cache the value of MSR_IA32_ARCH_CAPABILITIES
  x86/bugs: Fix BHI documentation
  x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n
  x86/topology: Don't update cpu_possible_map in topo_set_cpuids()
  x86/bugs: Fix return type of spectre_bhi_state()
  x86/apic: Force native_apic_mem_read() to use the MOV instruction

11 months agoMerge tag 'timers-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 14 Apr 2024 17:32:22 +0000 (10:32 -0700)]
Merge tag 'timers-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Ingo Molnar:

 - Address a (valid) W=1 build warning

 - Fix timer self-tests

 - Annotate a KCSAN warning wrt. accesses to the tick_do_timer_cpu
   global variable

 - Address a !CONFIG_BUG build warning

* tag 'timers-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  selftests: kselftest: Fix build failure with NOLIBC
  selftests: timers: Fix abs() warning in posix_timers test
  selftests: kselftest: Mark functions that unconditionally call exit() as __noreturn
  selftests: timers: Fix posix_timers ksft_print_msg() warning
  selftests: timers: Fix valid-adjtimex signed left-shift undefined behavior
  bug: Fix no-return-statement warning with !CONFIG_BUG
  timekeeping: Use READ/WRITE_ONCE() for tick_do_timer_cpu
  selftests/timers/posix_timers: Reimplement check_timer_distribution()
  irqflags: Explicitly ignore lockdep_hrtimer_exit() argument

11 months agoMerge tag 'perf-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 14 Apr 2024 17:26:27 +0000 (10:26 -0700)]
Merge tag 'perf-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf event fix from Ingo Molnar:
 "Fix the x86 PMU multi-counter code returning invalid data in certain
  circumstances"

* tag 'perf-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Fix out of range data

11 months agoMerge tag 'locking-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 14 Apr 2024 17:13:56 +0000 (10:13 -0700)]
Merge tag 'locking-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Ingo Molnar:
 "Fix a PREEMPT_RT build bug"

* tag 'locking-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking: Make rwsem_assert_held_write_nolockdep() build with PREEMPT_RT=y

11 months agoMerge tag 'irq-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 14 Apr 2024 17:12:34 +0000 (10:12 -0700)]
Merge tag 'irq-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fix from Ingo Molnar:
 "Fix a bug in the GIC irqchip driver"

* tag 'irq-urgent-2024-04-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/gic-v3-its: Fix VSYNC referencing an unmapped VPE on GIC v4.1

11 months agoMerge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Linus Torvalds [Sun, 14 Apr 2024 17:05:59 +0000 (10:05 -0700)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio bugfixes from Michael Tsirkin:
 "Some small, obvious (in hindsight) bugfixes:

   - new ioctl in vhost-vdpa has a wrong # - not too late to fix

   - vhost has apparently been lacking an smp_rmb() - due to code
     duplication :( The duplication will be fixed in the next merge
     cycle, this is a minimal fix

   - an error message in vhost talks about guest moving used index -
     which of course never happens, guest only ever moves the available
     index

   - i2c-virtio didn't set the driver owner so it did not get refcounted
     correctly"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  vhost: correct misleading printing information
  vhost-vdpa: change ioctl # for VDPA_GET_VRING_SIZE
  virtio: store owner from modules with register_virtio_driver()
  vhost: Add smp_rmb() in vhost_enable_notify()
  vhost: Add smp_rmb() in vhost_vq_avail_empty()

11 months agoMerge tag 'dma-maping-6.9-2024-04-14' of git://git.infradead.org/users/hch/dma-mapping
Linus Torvalds [Sun, 14 Apr 2024 17:02:40 +0000 (10:02 -0700)]
Merge tag 'dma-maping-6.9-2024-04-14' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:

 - fix up swiotlb buffer padding even more (Petr Tesarik)

 - fix for partial dma_sync on swiotlb (Michael Kelley)

 - swiotlb debugfs fix (Dexuan Cui)

* tag 'dma-maping-6.9-2024-04-14' of git://git.infradead.org/users/hch/dma-mapping:
  swiotlb: do not set total_used to 0 in swiotlb_create_debugfs_files()
  swiotlb: fix swiotlb_bounce() to do partial sync's correctly
  swiotlb: extend buffer pre-padding to alloc_align_mask if necessary

11 months agokernfs: annotate different lockdep class for of->mutex of writable files
Amir Goldstein [Fri, 5 Apr 2024 14:56:35 +0000 (17:56 +0300)]
kernfs: annotate different lockdep class for of->mutex of writable files

The writable file /sys/power/resume may call vfs lookup helpers for
arbitrary paths and readonly files can be read by overlayfs from vfs
helpers when sysfs is a lower layer of overalyfs.

To avoid a lockdep warning of circular dependency between overlayfs
inode lock and kernfs of->mutex, use a different lockdep class for
writable and readonly kernfs files.

Reported-by: [email protected]
Fixes: 0fedefd4c4e3 ("kernfs: sysfs: support custom llseek method for sysfs entries")
Suggested-by: Al Viro <[email protected]>
Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Al Viro <[email protected]>
11 months agoMerge tag 'ata-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux
Linus Torvalds [Sat, 13 Apr 2024 17:27:58 +0000 (10:27 -0700)]
Merge tag 'ata-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux

Pull ata fixes from Damien Le Moal:

 - Add the mask_port_map parameter to the ahci driver. This is a
   follow-up to the recent snafu with the ASMedia controller and its
   virtual port hidding port-multiplier devices. As ASMedia confirmed
   that there is no way to determine if these slow-to-probe virtual
   ports are actually representing the ports of a port-multiplier
   devices, this new parameter allow masking ports to significantly
   speed up probing during system boot, resulting in shorter boot times.

 - A fix for an incorrect handling of a port unlock in
   ata_scsi_dev_rescan().

 - Allow command duration limits to be detected for ACS-4 devices are
   there are such devices out in the field.

* tag 'ata-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
  ata: libata-core: Allow command duration limits detection for ACS-4 drives
  ata: libata-scsi: Fix ata_scsi_dev_rescan() error path
  ata: ahci: Add mask_port_map module parameter

11 months agoMerge tag 'zonefs-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal...
Linus Torvalds [Sat, 13 Apr 2024 17:25:32 +0000 (10:25 -0700)]
Merge tag 'zonefs-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull zonefs fix from Damien Le Moal:

 - Suppress a coccicheck warning using str_plural()

* tag 'zonefs-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: Use str_plural() to fix Coccinelle warning

11 months agoMerge tag 'v6.9-rc3-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sat, 13 Apr 2024 17:10:18 +0000 (10:10 -0700)]
Merge tag 'v6.9-rc3-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - fix for oops in cifs_get_fattr of deleted files

 - fix for the remote open counter going negative in some directory
   lease cases

 - fix for mkfifo to instantiate dentry to avoid possible crash

 - important fix to allow handling key rotation for mount and remount
   (ie cases that are becoming more common when password that was used
   for the mount will expire soon but will be replaced by new password)

* tag 'v6.9-rc3-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: fix broken reconnect when password changing on the server by allowing password rotation
  smb: client: instantiate when creating SFU files
  smb3: fix Open files on server counter going negative
  smb: client: fix NULL ptr deref in cifs_mark_open_handles_for_deleted_file()

11 months agoata: libata-core: Allow command duration limits detection for ACS-4 drives
Igor Pylypiv [Thu, 11 Apr 2024 20:12:24 +0000 (20:12 +0000)]
ata: libata-core: Allow command duration limits detection for ACS-4 drives

Even though the command duration limits (CDL) feature was first added
in ACS-5 (major version 12), there are some ACS-4 (major version 11)
drives that implement CDL as well.

IDENTIFY_DEVICE, SUPPORTED_CAPABILITIES, and CURRENT_SETTINGS log pages
are mandatory in the ACS-4 standard so it should be safe to read these
log pages on older drives implementing the ACS-4 standard.

Fixes: 62e4a60e0cdb ("scsi: ata: libata: Detect support for command duration limits")
Cc: [email protected]
Signed-off-by: Igor Pylypiv <[email protected]>
Signed-off-by: Damien Le Moal <[email protected]>
11 months agoata: libata-scsi: Fix ata_scsi_dev_rescan() error path
Damien Le Moal [Thu, 11 Apr 2024 23:41:15 +0000 (08:41 +0900)]
ata: libata-scsi: Fix ata_scsi_dev_rescan() error path

Commit 0c76106cb975 ("scsi: sd: Fix TCG OPAL unlock on system resume")
incorrectly handles failures of scsi_resume_device() in
ata_scsi_dev_rescan(), leading to a double call to
spin_unlock_irqrestore() to unlock a device port. Fix this by redefining
the goto labels used in case of errors and only unlock the port
scsi_scan_mutex when scsi_resume_device() fails.

Bug found with the Smatch static checker warning:

drivers/ata/libata-scsi.c:4774 ata_scsi_dev_rescan()
error: double unlocked 'ap->lock' (orig line 4757)

Reported-by: Dan Carpenter <[email protected]>
Fixes: 0c76106cb975 ("scsi: sd: Fix TCG OPAL unlock on system resume")
Cc: [email protected]
Signed-off-by: Damien Le Moal <[email protected]>
Reviewed-by: Niklas Cassel <[email protected]>
11 months agoMerge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Linus Torvalds [Fri, 12 Apr 2024 20:08:39 +0000 (13:08 -0700)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fix from Catalin Marinas:
 "Fix the TLBI RANGE operand calculation causing live migration under
  KVM/arm64 to miss dirty pages due to stale TLB entries"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: tlb: Fix TLBI RANGE operand

11 months agoMerge tag 'soc-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Linus Torvalds [Fri, 12 Apr 2024 20:02:27 +0000 (13:02 -0700)]
Merge tag 'soc-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc

Pull SoC fixes from Arnd Bergmann:
 "The device tree changes this time are all for NXP i.MX platforms,
  addressing issues with clocks and regulators on i.MX7 and i.MX8.

  The old OMAP2 based Nokia N8x0 tablet get a couple of code fixes for
  regressions that came in.

  The ARM SCMI and FF-A firmware interfaces get a couple of minor bug
  fixes.

  A regression fix for RISC-V cache management addresses a problem with
  probe order on Sifive cores"

* tag 'soc-fixes-6.9-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (23 commits)
  MAINTAINERS: Change Krzysztof Kozlowski's email address
  arm64: dts: imx8qm-ss-dma: fix can lpcg indices
  arm64: dts: imx8-ss-dma: fix can lpcg indices
  arm64: dts: imx8-ss-dma: fix adc lpcg indices
  arm64: dts: imx8-ss-dma: fix pwm lpcg indices
  arm64: dts: imx8-ss-dma: fix spi lpcg indices
  arm64: dts: imx8-ss-conn: fix usb lpcg indices
  arm64: dts: imx8-ss-lsio: fix pwm lpcg indices
  ARM: dts: imx7s-warp: Pass OV2680 link-frequencies
  ARM: dts: imx7-mba7: Use 'no-mmc' property
  arm64: dts: imx8-ss-conn: fix usdhc wrong lpcg clock order
  arm64: dts: freescale: imx8mp-venice-gw73xx-2x: fix USB vbus regulator
  arm64: dts: freescale: imx8mp-venice-gw72xx-2x: fix USB vbus regulator
  cache: sifive_ccache: Partially convert to a platform driver
  firmware: arm_scmi: Make raw debugfs entries non-seekable
  firmware: arm_scmi: Fix wrong fastchannel initialization
  firmware: arm_ffa: Fix the partition ID check in ffa_notification_info_get()
  ARM: OMAP2+: fix USB regression on Nokia N8x0
  mmc: omap: restore original power up/down steps
  mmc: omap: fix deferred probe
  ...

11 months agoMerge tag 'iommu-fixes-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 12 Apr 2024 19:56:19 +0000 (12:56 -0700)]
Merge tag 'iommu-fixes-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu

Pull iommu fixes from Joerg Roedel:

 - Intel VT-d Fixes:
     - Allocate local memory for PRQ page
     - Fix WARN_ON in iommu probe path
     - Fix wrong use of pasid config

 - AMD IOMMU Fixes:
     - Lock inversion fix
     - Log message severity fix
     - Disable SNP when v2 page-tables are used

 - Mediatek driver:
     - Fix module autoloading

* tag 'iommu-fixes-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/amd: Change log message severity
  iommu/vt-d: Fix WARN_ON in iommu probe path
  iommu/vt-d: Allocate local memory for page request queue
  iommu/vt-d: Fix wrong use of pasid config
  iommu: mtk: fix module autoloading
  iommu/amd: Do not enable SNP when V2 page table is enabled
  iommu/amd: Fix possible irq lock inversion dependency issue

11 months agoMerge tag 'pci-v6.9-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Linus Torvalds [Fri, 12 Apr 2024 19:47:48 +0000 (12:47 -0700)]
Merge tag 'pci-v6.9-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

Pull pci fixes from Bjorn Helgaas:

 - Revert a quirk that prevented Secondary Bus Reset for LSI / Agere
   FW643.

   We thought the device was broken, but the reset does work correctly
   on other platforms, and the reset avoids leaking data out of VMs
   (Bjorn Helgaas)

 - Update MAINTAINERS to reflect that Gustavo Pimentel is no longer
   reachable (Manivannan Sadhasivam)

* tag 'pci-v6.9-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
  Revert "PCI: Mark LSI FW643 to avoid bus reset"
  MAINTAINERS: Drop Gustavo Pimentel as PCI DWC Maintainer

11 months agoMerge tag 'block-6.9-20240412' of git://git.kernel.dk/linux
Linus Torvalds [Fri, 12 Apr 2024 17:22:33 +0000 (10:22 -0700)]
Merge tag 'block-6.9-20240412' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - MD pull request via Song:
       - UAF fix (Yu)

 - Avoid out-of-bounds shift in blk-iocost (Rik)

 - Fix for q->blkg_list corruption (Ming)

 - Relax virt boundary mask/size segment checking (Ming)

* tag 'block-6.9-20240412' of git://git.kernel.dk/linux:
  block: fix that blk_time_get_ns() doesn't update time after schedule
  block: allow device to have both virt_boundary_mask and max segment size
  block: fix q->blkg_list corruption during disk rebind
  blk-iocost: avoid out of bounds shift
  raid1: fix use-after-free for original bio in raid1_write_request()

11 months agoMerge tag 'io_uring-6.9-20240412' of git://git.kernel.dk/linux
Linus Torvalds [Fri, 12 Apr 2024 17:19:36 +0000 (10:19 -0700)]
Merge tag 'io_uring-6.9-20240412' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

 - Fix for sigmask restoring while waiting for events (Alexey)

 - Typo fix in comment (Haiyue)

 - Fix for a msg_control retstore on SEND_ZC retries (Pavel)

* tag 'io_uring-6.9-20240412' of git://git.kernel.dk/linux:
  io-uring: correct typo in comment for IOU_F_TWQ_LAZY_WAKE
  io_uring/net: restore msg_control on sendzc retry
  io_uring: Fix io_cqring_wait() not restoring sigmask on get_timespec64() failure

11 months agoMerge tag 'ceph-for-6.9-rc4' of https://github.com/ceph/ceph-client
Linus Torvalds [Fri, 12 Apr 2024 17:15:46 +0000 (10:15 -0700)]
Merge tag 'ceph-for-6.9-rc4' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Two CephFS fixes marked for stable and a MAINTAINERS update"

* tag 'ceph-for-6.9-rc4' of https://github.com/ceph/ceph-client:
  MAINTAINERS: remove myself as a Reviewer for Ceph
  ceph: switch to use cap_delay_lock for the unlink delay list
  ceph: redirty page before returning AOP_WRITEPAGE_ACTIVATE

11 months agoKconfig: add some hidden tabs on purpose
Linus Torvalds [Fri, 12 Apr 2024 17:05:10 +0000 (10:05 -0700)]
Kconfig: add some hidden tabs on purpose

Commit d96c36004e31 ("tracing: Fix FTRACE_RECORD_RECURSION_SIZE Kconfig
entry") removed a hidden tab because it apparently showed breakage in
some third-party kernel config parsing tool.

It wasn't clear what tool it was, but let's make sure it gets fixed.
Because if you can't parse tabs as whitespace, you should not be parsing
the kernel Kconfig files.

In fact, let's make such breakage more obvious than some esoteric ftrace
record size option.  If you can't parse tabs, you can't have page sizes.

Yes, tab-vs-space confusion is sadly a traditional Unix thing, and
'make' is famous for being broken in this regard.  But no, that does not
mean that it's ok.

I'd add more random tabs to our Kconfig files, but I don't want to make
things uglier than necessary.  But it *might* bbe necessary if it turns
out we see more of this kind of silly tooling.

Fixes: d96c36004e31 ("tracing: Fix FTRACE_RECORD_RECURSION_SIZE Kconfig entry")
Link: https://lore.kernel.org/lkml/CAHk-=wj-hLLN_t_m5OL4dXLaxvXKy_axuoJYXif7iczbfgAevQ@mail.gmail.com/
Signed-off-by: Linus Torvalds <[email protected]>
11 months agoMerge tag 'trace-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace...
Linus Torvalds [Fri, 12 Apr 2024 16:02:24 +0000 (09:02 -0700)]
Merge tag 'trace-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix the buffer_percent accounting as it is dependent on three
   variables:

     1) pages_read - number of subbuffers read
     2) pages_lost - number of subbuffers lost due to overwrite
     3) pages_touched - number of pages that a writer entered

   These three counters only increment, and to know how many active
   pages there are on the buffer at any given time, the pages_read and
   pages_lost are subtracted from pages_touched.

   But the pages touched was incremented whenever any writer went to the
   next subbuffer even if it wasn't the only one, so it was incremented
   more than it should be causing the counter for how many subbuffers
   currently have content incorrect, which caused the buffer_percent
   that holds waiters until the ring buffer is filled to a given
   percentage to wake up early.

 - Fix warning of unused functions when PERF_EVENTS is not configured in

 - Replace bad tab with space in Kconfig for FTRACE_RECORD_RECURSION_SIZE

 - Fix to some kerneldoc function comments in eventfs code.

* tag 'trace-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Only update pages_touched when a new page is touched
  tracing: hide unused ftrace_event_id_fops
  tracing: Fix FTRACE_RECORD_RECURSION_SIZE Kconfig entry
  eventfs: Fix kernel-doc comments to functions

11 months agoMerge tag 'mips-fixes_6.9_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips...
Linus Torvalds [Fri, 12 Apr 2024 15:46:58 +0000 (08:46 -0700)]
Merge tag 'mips-fixes_6.9_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux

Pull MIPS fix from Thomas Bogendoerfer:
 "Fix for syscall_get_nr() to make it work even if tracing is disabled"

* tag 'mips-fixes_6.9_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  MIPS: scall: Save thread_info.syscall unconditionally on entry

This page took 0.136858 seconds and 4 git commands to generate.