This document gives an overview of the LOCALIO auxiliary RPC protocol
added to the Linux NFS client and server to allow them to reliably
handshake to determine if they are on the same host.
Once an NFS client and server handshake as "local", the client will
bypass the network RPC protocol for read, write and commit operations.
Due to this XDR and RPC bypass, these operations will operate faster.
Mike Snitzer [Thu, 5 Sep 2024 19:09:57 +0000 (15:09 -0400)]
nfs: implement client support for NFS_LOCALIO_PROGRAM
The LOCALIO auxiliary RPC protocol consists of a single "UUID_IS_LOCAL"
RPC method that allows the Linux NFS client to verify the local Linux
NFS server can see the nonce (single-use UUID) the client generated and
made available in nfs_common for subsequent lookup and verification
by the NFS server. If matched, the NFS server populates members in the
nfs_uuid_t struct. The NFS client then transfers these nfs_uuid_t
struct member pointers to the nfs_client struct and cleans up the
nfs_uuid_t struct. See: fs/nfs/localio.c:nfs_local_probe()
This protocol isn't part of an IETF standard, nor does it need to be
considering it is Linux-to-Linux auxiliary RPC protocol that amounts
to an implementation detail.
Localio is only supported when UNIX-style authentication (AUTH_UNIX, aka
AUTH_SYS) is used (enforced by fs/nfs/localio.c:nfs_local_probe()).
The UUID_IS_LOCAL method encodes the client generated uuid_t in terms of
the fixed UUID_SIZE (16 bytes). The fixed size opaque encode and decode
XDR methods are used instead of the less efficient variable sized
methods.
Having a nonce (single-use uuid) is better than using the same uuid
for the life of the server, and sending it proactively by client
rather than reactively by the server is also safer.
nfs/localio: use dedicated workqueues for filesystem read and write
For localio access, don't call filesystem read() and write() routines
directly. This solves two problems:
1) localio writes need to use a normal (non-memreclaim) unbound
workqueue. This avoids imposing new requirements on how underlying
filesystems process frontend IO, which would cause a large amount
of work to update all filesystems. Without this change, when XFS
starts getting low on space, XFS flushes work on a non-memreclaim
work queue, which causes a priority inversion problem:
2) Some filesystem writeback routines can end up taking up a lot of
stack space (particularly XFS). Instead of risking running over
due to the extra overhead from the NFS stack, we should just call
these routines from a workqueue job. Since we need to do this to
address 1) above we're able to avoid possibly blowing the stack
"for free".
Use of dedicated workqueues improves performance over using the
system_unbound_wq.
Also, the creds used to open the file are used to override_creds() in
both nfs_local_call_read() and nfs_local_call_write() -- otherwise the
workqueue could have elevated capabilities (which the caller may not).
Lastly, care is taken to set PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO in
nfs_do_local_write() to avoid writeback deadlocks.
The PF_LOCAL_THROTTLE flag prevents deadlocks in balance_dirty_pages()
by causing writes to only be throttled against other writes to the
same bdi (it keeps the throttling local). Normally all writes to
bdi(s) are throttled equally (after throughput factors are allowed
for).
The PF_MEMALLOC_NOIO flag prevents the lower filesystem IO from
causing memory reclaim to re-enter filesystems or IO devices and so
prevents deadlocks from occuring where IO that cleans pages is
waiting on IO to complete.
Add client support for bypassing NFS for localhost reads, writes, and
commits. This is only useful when the client and the server are
running on the same host.
nfs_local_probe() is stubbed out, later commits will enable client and
server handshake via a Linux-only LOCALIO auxiliary RPC protocol.
This has dynamic binding with the nfsd module (via nfs_localio module
which is part of nfs_common). LOCALIO will only work if nfsd is
already loaded.
The "localio_enabled" nfs kernel module parameter can be used to
disable and enable the ability to use LOCALIO support.
CONFIG_NFS_LOCALIO enables NFS client support for LOCALIO.
Lastly, LOCALIO uses an nfsd_file to initiate all IO. To make proper
use of nfsd_file (and nfsd's filecache) its lifetime (duration before
nfsd_file_put is called) must extend until after commit, read and
write operations. So rather than immediately drop the nfsd_file
reference in nfs_local_open_fh(), that doesn't happen until
nfs_local_pgio_release() for read/write and not until
nfs_local_release_commit_data() for commit. The same applies to the
reference held on nfsd's nn->nfsd_serv. Both objects' lifetimes and
associated references are managed through calls to
nfs_to->nfsd_file_put_local().
Mike Snitzer [Thu, 5 Sep 2024 19:09:51 +0000 (15:09 -0400)]
nfsd: implement server support for NFS_LOCALIO_PROGRAM
The LOCALIO auxiliary RPC protocol consists of a single "UUID_IS_LOCAL"
RPC method that allows the Linux NFS client to verify the local Linux
NFS server can see the nonce (single-use UUID) the client generated and
made available in nfs_common. The server expects this protocol to use
the same transport as NFS and NFSACL for its RPCs. This protocol
isn't part of an IETF standard, nor does it need to be considering it
is Linux-to-Linux auxiliary RPC protocol that amounts to an
implementation detail.
The UUID_IS_LOCAL method encodes the client generated uuid_t in terms of
the fixed UUID_SIZE (16 bytes). The fixed size opaque encode and decode
XDR methods are used instead of the less efficient variable sized
methods.
The RPC program number for the NFS_LOCALIO_PROGRAM is 400122 (as assigned
by IANA, see https://www.iana.org/assignments/rpc-program-numbers/ ):
Linux Kernel Organization 400122 nfslocalio
Add server support for bypassing NFS for localhost reads, writes, and
commits. This is only useful when both the client and server are
running on the same host.
If nfsd_open_local_fh() fails then the NFS client will both retry and
fallback to normal network-based read, write and commit operations if
localio is no longer supported.
Care is taken to ensure the same NFS security mechanisms are used
(authentication, etc) regardless of whether localio or regular NFS
access is used. The auth_domain established as part of the traditional
NFS client access to the NFS server is also used for localio. Store
auth_domain for localio in nfsd_uuid_t and transfer it to the client
if it is local to the server.
Relative to containers, localio gives the client access to the network
namespace the server has. This is required to allow the client to
access the server's per-namespace nfsd_net struct.
This commit also introduces the use of NFSD's percpu_ref to interlock
nfsd_destroy_serv and nfsd_open_local_fh, to ensure nn->nfsd_serv is
not destroyed while in use by nfsd_open_local_fh and other LOCALIO
client code.
CONFIG_NFS_LOCALIO enables NFS server support for LOCALIO.
Mike Snitzer [Thu, 5 Sep 2024 19:09:49 +0000 (15:09 -0400)]
nfs_common: prepare for the NFS client to use nfsd_file for LOCALIO
The next commit will introduce nfsd_open_local_fh() which returns an
nfsd_file structure. This commit exposes LOCALIO's required NFSD
symbols to the NFS client:
- Make nfsd_open_local_fh() symbol and other required NFSD symbols
available to NFS in a global 'nfs_to' nfsd_localio_operations
struct (global access suggested by Trond, nfsd_localio_operations
suggested by NeilBrown). The next commit will also introduce
nfsd_localio_ops_init() that init_nfsd() will call to initialize
'nfs_to'.
- Introduce nfsd_file_file() that provides access to nfsd_file's
backing file. Keeps nfsd_file structure opaque to NFS client (as
suggested by Jeff Layton).
- Introduce nfsd_file_put_local() that will put the reference to the
nfsd_file's associated nn->nfsd_serv and then put the reference to
the nfsd_file (as suggested by NeilBrown).
fs/nfs_common/nfslocalio.c provides interfaces that enable an NFS
client to generate a nonce (single-use UUID) and associated nfs_uuid_t
struct, register it with nfs_common for subsequent lookup and
verification by the NFS server and if matched the NFS server populates
members in the nfs_uuid_t struct.
nfs_common's nfs_uuids list is the basis for localio enablement, as
such it has members that point to nfsd memory for direct use by the
client (e.g. 'net' is the server's network namespace, through it the
client can access nn->nfsd_serv).
This commit also provides the base nfs_uuid_t interfaces to allow
proper net namespace refcounting for the LOCALIO use case.
CONFIG_NFS_LOCALIO controls the nfs_common, NFS server and NFS client
enablement for LOCALIO. If both NFS_FS=m and NFSD=m then
NFS_COMMON_LOCALIO_SUPPORT=m and nfs_localio.ko is built (and provides
nfs_common's LOCALIO support).
Add new funtion svcauth_map_clnt_to_svc_cred_local which maps a
generic cred to a svc_cred suitable for use in nfsd.
This is needed by the localio code to map nfs client creds to nfs
server credentials.
Following from net/sunrpc/auth_unix.c:unx_marshal() it is clear that
->fsuid and ->fsgid must be used (rather than ->uid and ->gid). In
addition, these uid and gid must be translated with from_kuid_munged()
so local client uses correct uid and gid when acting as local server.
Jeff Layton noted:
This is where the magic happens. Since we're working in
kuid_t/kgid_t, we don't need to worry about further idmapping.
There is really no value in triggering a BUG_ON in response to either
of these previously unsupported conditions.
NeilBrown would like the entire 'if (proc->p_proc != 0)' branch
removed (not just the one BUG_ON that must be removed for LOCALIO's
immediate needs of returning void).
Mike Snitzer [Thu, 5 Sep 2024 19:09:44 +0000 (15:09 -0400)]
nfsd: add nfsd_serv_try_get and nfsd_serv_put
Introduce nfsd_serv_try_get and nfsd_serv_put and update the nfsd code
to prevent nfsd_destroy_serv from destroying nn->nfsd_serv until any
caller of nfsd_serv_try_get releases their reference using nfsd_serv_put.
A percpu_ref is used to implement the interlock between
nfsd_destroy_serv and any caller of nfsd_serv_try_get.
This interlock is needed to properly wait for the completion of client
initiated localio calls to nfsd (that are _not_ in the context of nfsd).
nfsd_file_acquire_local() can be used to look up a file by filehandle
without having a struct svc_rqst. This can be used by NFS LOCALIO to
allow the NFS client to bypass the NFS protocol to directly access a
file provided by the NFS server which is running in the same kernel.
In nfsd_file_do_acquire() care is taken to always use fh_verify() if
rqstp is not NULL (as is the case for non-LOCALIO callers). Otherwise
the non-LOCALIO callers will not supply the correct and required
arguments to __fh_verify (e.g. gssclient isn't passed).
Introduce fh_verify_local() wrapper around __fh_verify to make it
clear that LOCALIO is intended caller.
Also, use GC for nfsd_file returned by nfsd_file_acquire_local. GC
offers performance improvements if/when a file is reopened before
launderette cleans it from the filecache's LRU.
nfsd: factor out __fh_verify to allow NULL rqstp to be passed
__fh_verify() offers an interface like fh_verify() but doesn't require
a struct svc_rqst *, instead it also takes the specific parts as
explicit required arguments. So it is safe to call __fh_verify() with
a NULL rqstp, but the net, cred, and client args must not be NULL.
__fh_verify() does not use SVC_NET(), nor does the functions it calls.
Rather than using rqstp->rq_client pass the client and gssclient
explicitly to __fh_verify and then to nfsd_set_fh_dentry().
Lastly, it should be noted that the previous commit prepared for 4
associated tracepoints to only be used if rqstp is not NULL (this is a
stop-gap that should be properly fixed so localio also benefits from
the utility these tracepoints provide when debugging fh_verify
issues).
Chuck Lever [Thu, 5 Sep 2024 19:09:41 +0000 (15:09 -0400)]
NFSD: Short-circuit fh_verify tracepoints for LOCALIO
LOCALIO will be able to call fh_verify() with a NULL rqstp. In this
case, the existing trace points need to be skipped because they
want to dereference the address fields in the passed-in rqstp.
Temporarily make these trace points conditional to avoid a seg
fault in this case. Putting the "rqstp != NULL" check in the trace
points themselves makes the check more efficient.
Chuck Lever [Thu, 5 Sep 2024 19:09:40 +0000 (15:09 -0400)]
NFSD: Avoid using rqstp->rq_vers in nfsd_set_fh_dentry()
Currently, fh_verify() makes some daring assumptions about which
version of file handle the caller wants, based on the things it can
find in the passed-in rqstp. The about-to-be-introduced LOCALIO use
case sometimes has no svc_rqst context, so this logic won't work in
that case.
Instead, examine the passed-in file handle. It's .max_size field
should carry information to allow nfsd_set_fh_dentry() to initialize
the file handle appropriately.
The file handle used by lockd and the one created by write_filehandle
never need any of the version-specific fields (which affect things
like write and getattr requests and pre/post attributes).
There are several places where __fh_verify unconditionally dereferences
rqstp to check that the connection is suitably secure. They look at
rqstp->rq_xprt which is not meaningful in the target use case of
"localio" NFS in which the client talks directly to the local server.
Prepare these to always succeed when rqstp is NULL.
Dan Aloni [Wed, 24 Jul 2024 11:07:12 +0000 (14:07 +0300)]
nfs: add 'noalignwrite' option for lock-less 'lost writes' prevention
There are some applications that write to predefined non-overlapping
file offsets from multiple clients and therefore don't need to rely on
file locking. However, if these applications want non-aligned offsets
and sizes they need to either use locks or risk data corruption, as the
NFS client defaults to extending writes to whole pages.
This commit adds a new mount option `noalignwrite`, which allows to turn
that off and avoid the need of locking, as long as these applications
don't overlap on offsets.
Li Lingfeng [Sat, 24 Aug 2024 01:43:35 +0000 (09:43 +0800)]
nfs: fix the comment of nfs_get_root
The comment for nfs_get_root() needs to be updated as it would also be
used by NFS4 as follows:
@x[
nfs_get_root+1
nfs_get_tree_common+1819
nfs_get_tree+2594
vfs_get_tree+73
fc_mount+23
do_nfs4_mount+498
nfs4_try_get_tree+134
nfs_get_tree+2562
vfs_get_tree+73
path_mount+2776
do_mount+226
__se_sys_mount+343
__x64_sys_mount+106
do_syscall_64+69
entry_SYSCALL_64_after_hwframe+97
, mount.nfs4]: 1
Roi Azarzar [Sun, 15 Sep 2024 10:27:35 +0000 (10:27 +0000)]
NFSv4.2: Fix detection of "Proxying of Times" server support
According to draft-ietf-nfsv4-delstid-07:
If a server informs the client via the fattr4_open_arguments
attribute that it supports
OPEN_ARGS_SHARE_ACCESS_WANT_DELEG_TIMESTAMPS and it returns a valid
delegation stateid for an OPEN operation which sets the
OPEN4_SHARE_ACCESS_WANT_DELEG_TIMESTAMPS flag, then it MUST query the
client via a CB_GETATTR for the fattr4_time_deleg_access (see
Section 5.2) attribute and fattr4_time_deleg_modify attribute (see
Section 5.2).
Thus, we should look that the server supports proxying of times via
OPEN4_SHARE_ACCESS_WANT_DELEG_TIMESTAMPS.
We want to be extra pedantic and continue to check that FATTR4_TIME_DELEG_ACCESS
and FATTR4_TIME_DELEG_MODIFY are set. The server needs to expose both for the
client to correctly detect "Proxying of Times" support.
Signed-off-by: Roi Azarzar <[email protected]> Reviewed-by: Jeff Layton <[email protected]> Fixes: dcb3c20f7419 ("NFSv4: Add a capability for delegated attributes") Signed-off-by: Anna Schumaker <[email protected]>
If the server is down when the client is trying to mount, so that the
calls to exchange_id or create_session fail, then we should allow the
mount system call to fail rather than hang and block other mount/umount
calls.
Zhaoyang Huang [Fri, 30 Aug 2024 03:27:47 +0000 (11:27 +0800)]
fs: nfs: fix missing refcnt by replacing folio_set_private by folio_attach_private
This patch is inspired by a code review of fs codes which aims at
folio's extra refcnt that could introduce unwanted behavious when
judging refcnt, such as[1].That is, the folio passed to
mapping_evict_folio carries the refcnts from find_lock_entries,
page_cache, corresponding to PTEs and folio's private if has. However,
current code doesn't take the refcnt for folio's private which could
have mapping_evict_folio miss the one to only PTE and lead to
call filemap_release_folio wrongly.
[1]
long mapping_evict_folio(struct address_space *mapping, struct folio *folio)
{
...
//current code will misjudge here if there is one pte on the folio which
is be deemed as the one as folio's private
if (folio_ref_count(folio) >
folio_nr_pages(folio) + folio_has_private(folio) + 1)
return 0;
if (!filemap_release_folio(folio, 0))
return 0;
Gaosheng Cui [Mon, 26 Aug 2024 03:21:57 +0000 (11:21 +0800)]
nfs: Remove obsoleted declaration for nfs_read_prepare
The nfs_read_prepare() have been removed since
commit a4cdda59111f ("NFS: Create a common pgio_rpc_prepare function"),
and now it is useless, so remove it.
destroy_wait doesn't store all RPC clients. There was a list named
"all_clients" above it, which got moved to struct sunrpc_net in 2012,
but the comment was never removed.
Fixes: 70abc49b4f4a ("SUNRPC: make SUNPRC clients list per network namespace context") Signed-off-by: Siddh Raman Pant <[email protected]> Signed-off-by: Anna Schumaker <[email protected]>
Stephen Brennan [Mon, 19 Aug 2024 15:58:59 +0000 (08:58 -0700)]
SUNRPC: convert RPC_TASK_* constants to enum
The RPC_TASK_* constants are defined as macros, which means that most
kernel builds will not contain their definitions in the debuginfo.
However, it's quite useful for debuggers to be able to view the task
state constant and interpret it correctly. Conversion to an enum will
ensure the constants are present in debuginfo and can be interpreted by
debuggers without needing to hard-code them and track their changes.
Kunwu Chan [Fri, 16 Aug 2024 02:07:40 +0000 (10:07 +0800)]
SUNRPC: Fix -Wformat-truncation warning
Increase size of the servername array to avoid truncated output warning.
net/sunrpc/clnt.c:582:75: error:‘%s’ directive output may be truncated
writing up to 107 bytes into a region of size 48
[-Werror=format-truncation=]
582 | snprintf(servername, sizeof(servername), "%s",
| ^~
net/sunrpc/clnt.c:582:33: note:‘snprintf’ output
between 1 and 108 bytes into a destination of size 48
582 | snprintf(servername, sizeof(servername), "%s",
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
583 | sun->sun_path);
Thorsten Blum [Wed, 14 Aug 2024 10:01:28 +0000 (12:01 +0200)]
nfs: Annotate struct nfs_cache_array with __counted_by()
Add the __counted_by compiler attribute to the flexible array member
array to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Increment size before adding a new struct to the array.
I have evidence of an Linux NFS client getting NFS4ERR_BAD_SEQID to a
v4.0 LOCK request to a Linux server (which had fixed the problem with
RELEASE_LOCKOWNER bug fixed).
The LOCK request presented a "new" lock owner so there are two seq ids
in the request: that for the open file, and that for the new lock.
Given the context I am confident that the new lock owner was reported to
have the wrong seqid. As lock owner identifiers are reused, the server
must still have a lock owner active which the client thinks is no longer
active.
I wasn't able to determine a root-cause but the simplest fix seems to be
to ensure lock owners are always unique much as open owners are (thanks
to a time stamp). The easiest way to ensure uniqueness is with a 64bit
counter for each server. That will never cycle (if updated once a
nanosecond the last 584 years. A single NFS server would not handle
open/lock requests nearly that fast, and a Linux node is unlikely to
have an uptime approaching that).
This patch removes the 2 ida and instead uses a per-server
atomic64_t to provide uniqueness.
Note that the lock owner already encodes the id as 64 bits even though
it is a 32bit value. So changing to a 64bit value does not change the
encoding of the lock owner. The open owner encoding is now 4 bytes
larger.
Li Lingfeng [Wed, 4 Sep 2024 12:34:57 +0000 (20:34 +0800)]
nfs: fix memory leak in error path of nfs4_do_reclaim
Commit c77e22834ae9 ("NFSv4: Fix a potential sleep while atomic in
nfs4_do_reclaim()") separate out the freeing of the state owners from
nfs4_purge_state_owners() and finish it outside the rcu lock.
However, the error path is omitted. As a result, the state owners in
"freeme" will not be released.
Fix it by adding freeing in the error path.
Fixes: c77e22834ae9 ("NFSv4: Fix a potential sleep while atomic in nfs4_do_reclaim()") Signed-off-by: Li Lingfeng <[email protected]> Cc: [email protected] # v5.3+ Signed-off-by: Anna Schumaker <[email protected]>
Anna Schumaker [Mon, 23 Sep 2024 19:00:07 +0000 (15:00 -0400)]
Merge tag 'nfsd-6.12' into linux-next-with-localio
NFSD 6.12 Release Notes
Notable features of this release include:
- Pre-requisites for automatically determining the RPC server thread
count
- Clean-up and preparation for supporting LOCALIO, which will be
merged via the NFS client tree
- Enhancements and fixes to NFSv4.2 COPY offload
- A new Python-based tool for generating kernel SunRPC XDR encoding
and decoding functions, added as an aid for prototyping features
in protocols based on the Linux kernel's SunRPC implementation.
As always I am grateful to the NFSD contributors, reviewers,
testers, and bug reporters who participated during this cycle.
Chuck Lever [Fri, 13 Sep 2024 17:50:56 +0000 (13:50 -0400)]
xdrgen: Prevent reordering of encoder and decoder functions
I noticed that "xdrgen source" reorders the procedure encoder and
decoder functions every time it is run. I would prefer that the
generated code be more deterministic: it enables a reader to better
see exactly what has changed between runs of the tool.
The problem is that Python sets are not ordered. I use a Python set
to ensure that, when multiple procedures use a particular argument or
result type, the encoder/decoder for that type is emitted only once.
Sets aren't ordered, but I can use Python dictionaries for this
purpose to ensure the procedure functions are always emitted in the
same order if the .x file does not change.
Chuck Lever [Fri, 13 Sep 2024 18:08:13 +0000 (14:08 -0400)]
tools: Add xdrgen
Add a Python-based tool for translating XDR specifications into XDR
encoder and decoder functions written in the Linux kernel's C coding
style. The generator attempts to match the usual C coding style of
the Linux kernel's SunRPC consumers.
This approach is similar to the netlink code generator in
tools/net/ynl .
The maintainability benefits of machine-generated XDR code include:
- Stronger type checking
- Reduces the number of bugs introduced by human error
- Makes the XDR code easier to audit and analyze
- Enables rapid prototyping of new RPC-based protocols
- Hardens the layering between protocol logic and marshaling
- Makes it easier to add observability on demand
- Unit tests might be built for both the tool and (automatically)
for the generated code
In addition, converting the XDR layer to use memory-safe languages
such as Rust will be easier if much of the code can be converted
automatically.
nfsd: fix delegation_blocked() to block correctly for at least 30 seconds
The pair of bloom filtered used by delegation_blocked() was intended to
block delegations on given filehandles for between 30 and 60 seconds. A
new filehandle would be recorded in the "new" bit set. That would then
be switch to the "old" bit set between 0 and 30 seconds later, and it
would remain as the "old" bit set for 30 seconds.
Unfortunately the code intended to clear the old bit set once it reached
30 seconds old, preparing it to be the next new bit set, instead cleared
the *new* bit set before switching it to be the old bit set. This means
that the "old" bit set is always empty and delegations are blocked
between 0 and 30 seconds.
This patch updates bd->new before clearing the set with that index,
instead of afterwards.
Jeff Layton [Mon, 9 Sep 2024 14:40:53 +0000 (10:40 -0400)]
nfsd: fix initial getattr on write delegation
At this point in compound processing, currentfh refers to the parent of
the file, not the file itself. Get the correct dentry from the delegation
stateid instead.
Fixes: c5967721e106 ("NFSD: handle GETATTR conflict with write delegation") Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
NeilBrown [Thu, 29 Aug 2024 13:26:40 +0000 (09:26 -0400)]
nfsd: untangle code in nfsd4_deleg_getattr_conflict()
The code in nfsd4_deleg_getattr_conflict() is convoluted and buggy.
With this patch we:
- properly handle non-nfsd leases. We must not assume flc_owner is a
delegation unless fl_lmops == &nfsd_lease_mng_ops
- move the main code out of the for loop
- have a single exit which calls nfs4_put_stid()
(and other exits which don't need to call that)
[ jlayton: refactored on top of Neil's other patch: nfsd: fix
nfsd4_deleg_getattr_conflict in presence of third party lease ]
Fixes: c5967721e106 ("NFSD: handle GETATTR conflict with write delegation") Signed-off-by: NeilBrown <[email protected]> Signed-off-by: Jeff Layton <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
Scott Mayhew [Mon, 9 Sep 2024 20:28:54 +0000 (16:28 -0400)]
nfsd: enforce upper limit for namelen in __cld_pipe_inprogress_downcall()
This patch is intended to go on top of "nfsd: return -EINVAL when
namelen is 0" from Li Lingfeng. Li's patch checks for 0, but we should
be enforcing an upper bound as well.
Note that if nfsdcld somehow gets an id > NFS4_OPAQUE_LIMIT in its
database, it'll truncate it to NFS4_OPAQUE_LIMIT when it does the
downcall anyway.
Li Lingfeng [Tue, 3 Sep 2024 11:14:46 +0000 (19:14 +0800)]
nfsd: return -EINVAL when namelen is 0
When we have a corrupted main.sqlite in /var/lib/nfs/nfsdcld/, it may
result in namelen being 0, which will cause memdup_user() to return
ZERO_SIZE_PTR.
When we access the name.data that has been assigned the value of
ZERO_SIZE_PTR in nfs4_client_to_reclaim(), null pointer dereference is
triggered.
Chuck Lever [Wed, 28 Aug 2024 17:40:04 +0000 (13:40 -0400)]
NFSD: Limit the number of concurrent async COPY operations
Nothing appears to limit the number of concurrent async COPY
operations that clients can start. In addition, AFAICT each async
COPY can copy an unlimited number of 4MB chunks, so can run for a
long time. Thus IMO async COPY can become a DoS vector.
Add a restriction mechanism that bounds the number of concurrent
background COPY operations. Start simple and try to be fair -- this
patch implements a per-namespace limit.
An async COPY request that occurs while this limit is exceeded gets
NFS4ERR_DELAY. The requesting client can choose to send the request
again after a delay or fall back to a traditional read/write style
copy.
If there is need to make the mechanism more sophisticated, we can
visit that in future patches.
Chuck Lever [Wed, 28 Aug 2024 17:40:03 +0000 (13:40 -0400)]
NFSD: Async COPY result needs to return a write verifier
Currently, when NFSD handles an asynchronous COPY, it returns a
zero write verifier, relying on the subsequent CB_OFFLOAD callback
to pass the write verifier and a stable_how4 value to the client.
However, if the CB_OFFLOAD never arrives at the client (for example,
if a network partition occurs just as the server sends the
CB_OFFLOAD operation), the client will never receive this verifier.
Thus, if the client sends a follow-up COMMIT, there is no way for
the client to assess the COMMIT result.
The usual recovery for a missing CB_OFFLOAD is for the client to
send an OFFLOAD_STATUS operation, but that operation does not carry
a write verifier in its result. Neither does it carry a stable_how4
value, so the client /must/ send a COMMIT in this case -- which will
always fail because currently there's still no write verifier in the
COPY result.
Thus the server needs to return a normal write verifier in its COPY
result even if the COPY operation is to be performed asynchronously.
If the server recognizes the callback stateid in subsequent
OFFLOAD_STATUS operations, then obviously it has not restarted, and
the write verifier the client received in the COPY result is still
valid and can be used to assess a COMMIT of the copied data, if one
is needed.
NeilBrown [Fri, 30 Aug 2024 07:03:17 +0000 (17:03 +1000)]
nfsd: avoid races with wake_up_var()
wake_up_var() needs a barrier after the important change is made in the
var and before wake_up_var() is called, else it is possible that a wake
up won't be sent when it should.
In each case here the var is changed in an "atomic" manner, so
smb_mb__after_atomic() is sufficient.
In one case the important change (removing the lease) is performed
*after* the wake_up, which is backwards. The code survives in part
because the wait_var_event is given a timeout.
This patch adds the required barriers and calls destroy_delegation()
*before* waking any threads waiting for the delegation to be destroyed.
Thorsten Blum [Wed, 28 Aug 2024 21:42:55 +0000 (23:42 +0200)]
NFSD: Annotate struct pnfs_block_deviceaddr with __counted_by()
Add the __counted_by compiler attribute to the flexible array member
volumes to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Use struct_size() instead of manually calculating the number of bytes to
allocate for a pnfs_block_deviceaddr with a single volume.
Guoqing Jiang [Wed, 21 Aug 2024 14:03:18 +0000 (22:03 +0800)]
nfsd: call cache_put if xdr_reserve_space returns NULL
If not enough buffer space available, but idmap_lookup has triggered
lookup_fn which calls cache_get and returns successfully. Then we
missed to call cache_put here which pairs with cache_get.
Fixes: ddd1ea563672 ("nfsd4: use xdr_reserve_space in attribute encoding") Signed-off-by: Guoqing Jiang <[email protected]> Reviwed-by: Jeff Layton <[email protected]> Signed-off-by: Chuck Lever <[email protected]>
Jeff Layton [Mon, 26 Aug 2024 12:50:12 +0000 (08:50 -0400)]
nfsd: track the main opcode for callbacks
Keep track of the "main" opcode for the callback, and display it in the
tracepoint. This makes it simpler to discern what's happening when there
is more than one callback in flight.
The one special case is the CB_NULL RPC. That's not a CB_COMPOUND
opcode, so designate the value 0 for that.
Li Lingfeng [Fri, 23 Aug 2024 07:00:49 +0000 (15:00 +0800)]
nfsd: remove unused parameter of nfsd_file_mark_find_or_create
Commit 427f5f83a319 ("NFSD: Ensure nf_inode is never dereferenced") passes
inode directly to nfsd_file_mark_find_or_create instead of getting it from
nf, so there is no need to pass nf.
Li Lingfeng [Wed, 14 Aug 2024 11:29:07 +0000 (19:29 +0800)]
NFSD: remove redundant assignment operation
Commit 5826e09bf3dd ("NFSD: OP_CB_RECALL_ANY should recall both read and
write delegations") added a new assignment statement to add
RCA4_TYPE_MASK_WDATA_DLG to ra_bmval bitmask of OP_CB_RECALL_ANY. So the
old one should be removed.
Chuck Lever [Sun, 11 Aug 2024 17:11:07 +0000 (13:11 -0400)]
NFSD: Fix NFSv4's PUTPUBFH operation
According to RFC 8881, all minor versions of NFSv4 support PUTPUBFH.
Replace the XDR decoder for PUTPUBFH with a "noop" since we no
longer want the minorversion check, and PUTPUBFH has no arguments to
decode. (Ideally nfsd4_decode_noop should really be called
nfsd4_decode_void).
Mark Grimes [Wed, 7 Aug 2024 01:58:34 +0000 (18:58 -0700)]
nfsd: Add quotes to client info 'callback address'
The 'callback address' in client_info_show is output without quotes
causing yaml parsers to fail on processing IPv6 addresses.
Adding quotes to 'callback address' also matches that used by
the 'address' field.
Chuck Lever [Mon, 29 Jul 2024 20:52:32 +0000 (16:52 -0400)]
svcrdma: Handle device removal outside of the CM event handler
Synchronously wait for all disconnects to complete to ensure the
transports have divested all hardware resources before the
underlying RDMA device can safely be removed.
NeilBrown [Wed, 14 Aug 2024 13:21:01 +0000 (09:21 -0400)]
nfsd: move error choice for incorrect object types to version-specific code.
If an NFS operation expects a particular sort of object (file, dir, link,
etc) but gets a file handle for a different sort of object, it must
return an error. The actual error varies among NFS versions in non-trivial
ways.
For v2 and v3 there are ISDIR and NOTDIR errors and, for NFSv4 only,
INVAL is suitable.
For v4.0 there is also NFS4ERR_SYMLINK which should be used if a SYMLINK
was found when not expected. This take precedence over NOTDIR.
For v4.1+ there is also NFS4ERR_WRONG_TYPE which should be used in
preference to EINVAL when none of the specific error codes apply.
When nfsd_mode_check() finds a symlink where it expected a directory it
needs to return an error code that can be converted to NOTDIR for v2 or
v3 but will be SYMLINK for v4. It must be different from the error
code returns when it finds a symlink but expects a regular file - that
must be converted to EINVAL or SYMLINK.
So we introduce an internal error code nfserr_symlink_not_dir which each
version converts as appropriate.
nfsd_check_obj_isreg() is similar to nfsd_mode_check() except that it is
only used by NFSv4 and only for OPEN. NFSERR_INVAL is never a suitable
error if the object is the wrong time. For v4.0 we use nfserr_symlink
for non-dirs even if not a symlink. For v4.1 we have nfserr_wrong_type.
We handle this difference in-place in nfsd_check_obj_isreg() as there is
nothing to be gained by delaying the choice to nfsd4_map_status().
As a result of these changes, nfsd_mode_check() doesn't need an rqstp
arg any more.
Note that NFSv4 operations are actually performed in the xdr code(!!!)
so to the only place that we can map the status code successfully is in
nfsd4_encode_operation().
nfsd: be more systematic about selecting error codes for internal use.
Rather than using ad hoc values for internal errors (30000, 11000, ...)
use 'enum' to sequentially allocate numbers starting from the first
known available number - now visible as NFS4ERR_FIRST_FREE.
The goal is values that are distinct from all be32 error codes. To get
those we must first select integers that are not already used, then
convert them with cpu_to_be32().
nfsd: Move error code mapping to per-version proc code.
There is code scattered around nfsd which chooses an error status based
on the particular version of nfs being used. It is cleaner to have the
version specific choices in version specific code.
With this patch common code returns the most specific error code
possible and the version specific code maps that if necessary.
Both v2 (nfsproc.c) and v3 (nfs3proc.c) now have a "map_status()"
function which is called to map the resp->status before each non-trivial
nfsd_proc_* or nfsd3_proc_* function returns.
NFS4ERR_SYMLINK and NFS4ERR_WRONG_TYPE introduce extra complications and
are left for a later patch.
With this patch the only places that test ->rq_vers against a specific
version are nfsd_v4client() and nfsd_set_fh_dentry().
The latter sets some flags in the svc_fh, which now includes:
fh_64bit_cookies
fh_use_wgather
nfsd: use nfsd_v4client() in nfsd_breaker_owns_lease()
nfsd_breaker_owns_lease() currently open-codes the same test that
nfsd_v4client() performs.
With this patch we use nfsd_v4client() instead.
Also as i_am_nfsd() is only used in combination with kthread_data(),
replace it with nfsd_current_rqst() which combines the two and returns a
valid svc_rqst, or NULL.
The test for NULL is moved into nfsd_v4client() for code clarity.
nfsd: Pass 'cred' instead of 'rqstp' to some functions.
nfsd_permission(), exp_rdonly(), nfsd_setuser(), and nfsexp_flags()
only ever need the cred out of rqstp, so pass it explicitly instead of
the whole rqstp.
sunrpc: allow svc threads to fail initialisation cleanly
If an svc thread needs to perform some initialisation that might fail,
it has no good way to handle the failure.
Before the thread can exit it must call svc_exit_thread(), but that
requires the service mutex to be held. The thread cannot simply take
the mutex as that could deadlock if there is a concurrent attempt to
shut down all threads (which is unlikely, but not impossible).
nfsd currently call svc_exit_thread() unprotected in the unlikely event
that unshare_fs_struct() fails.
We can clean this up by introducing svc_thread_init_status() by which an
svc thread can report whether initialisation has succeeded. If it has,
it continues normally into the action loop. If it has not,
svc_thread_init_status() immediately aborts the thread.
svc_start_kthread() waits for either of these to happen, and calls
svc_exit_thread() (under the mutex) if the thread aborted.
sunrpc: don't take ->sv_lock when updating ->sv_nrthreads.
As documented in svc_xprt.c, sv_nrthreads is protected by the service
mutex, and it does not need ->sv_lock.
(->sv_lock is needed only for sv_permsocks, sv_tempsocks, and
sv_tmpcnt).
Merge tag 'timers_urgent_for_v6.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Borislav Petkov:
- Remove percpu irq related code in the timer-of initialization routine
as it is broken but also unused (Daniel Lezcano)
- Fix return -ETIME when delta exceeds INT_MAX and the next event not
taking effect sometimes (Jacky Bai)
* tag 'timers_urgent_for_v6.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/imx-tpm: Fix next event not taking effect sometime
clocksource/drivers/imx-tpm: Fix return -ETIME when delta exceeds INT_MAX
clocksource/drivers/timer-of: Remove percpu irq related code
Merge tag 'perf_urgent_for_v6.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Fix perf's AUX buffer serialization
- Prevent uninitialized struct members in perf's uprobes handling
* tag 'perf_urgent_for_v6.11_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/aux: Fix AUX buffer serialization
uprobes: Use kzalloc to allocate xol area
Merge tag 'char-misc-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are some small char/misc/other driver fixes for 6.11-rc7. It's
nothing huge, just a bunch of small fixes of reported problems,
including:
- lots of tiny iio driver fixes
- nvmem driver fixex
- binder UAF bugfix
- uio driver crash fix
- other small fixes
All of these have been in linux-next this week with no reported
problems"
* tag 'char-misc-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (21 commits)
VMCI: Fix use-after-free when removing resource in vmci_resource_remove()
Drivers: hv: vmbus: Fix rescind handling in uio_hv_generic
uio_hv_generic: Fix kernel NULL pointer dereference in hv_uio_rescind
misc: keba: Fix sysfs group creation
dt-bindings: nvmem: Use soc-nvmem node name instead of nvmem
nvmem: Fix return type of devm_nvmem_device_get() in kerneldoc
nvmem: u-boot-env: error if NVMEM device is too small
misc: fastrpc: Fix double free of 'buf' in error path
binder: fix UAF caused by offsets overwrite
iio: imu: inv_mpu6050: fix interrupt status read for old buggy chips
iio: adc: ad7173: fix GPIO device info
iio: adc: ad7124: fix DT configuration parsing
iio: adc: ad_sigma_delta: fix irq_flags on irq request
iio: adc: ads1119: Fix IRQ flags
iio: fix scale application in iio_convert_raw_to_processed_unlocked
iio: adc: ad7124: fix config comparison
iio: adc: ad7124: fix chip ID mismatch
iio: adc: ad7173: Fix incorrect compatible string
iio: buffer-dmaengine: fix releasing dma channel on error
iio: adc: ad7606: remove frstdata check for serial mode
...
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"A pile of Qualcomm clk driver fixes with two main themes: the alpha
PLL driver and shared RCGs, and one fix for the Starfive JH7110 SoC.
- The Alpha PLL clk_ops had multiple problems around setting rates.
There are a handful of patches here that fix masks and skip
enabling the clk from set_rate() when the PLL is disabled. The PLLs
are crucial to operation of the system as almost all frequencies in
the system are derived from them.
- Parking shared RCGs at a slow always on clk at registration time
breaks stuff.
USB host mode can't handle such a slow frequency and the serial
console gets all garbled when the UART clk is handed over to the
kernel. There's a few patches that don't use the shared clk_ops for
the UART clks and another one to skip parking the USB clk at
registration time.
- The Starfive PLL driver used for the CPU was busted causing cpufreq
to fail because the clk didn't change to a safe parent during
set_rate().
The fix is to register a notifier and switch to a safe parent so
the PLL can change rate in a glitch free manner"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: qcom: gcc-sc8280xp: don't use parking clk_ops for QUPs
clk: starfive: jh7110-sys: Add notifier for PLL0 clock
clk: qcom: gcc-sm8650: Don't use shared clk_ops for QUPs
clk: qcom: gcc-sm8550: Don't park the USB RCG at registration time
clk: qcom: gcc-sm8550: Don't use parking clk_ops for QUPs
clk: qcom: gcc-x1e80100: Don't use parking clk_ops for QUPs
clk: qcom: ipq9574: Update the alpha PLL type for GPLLs
clk: qcom: gcc-x1e80100: Fix USB 0 and 1 PHY GDSC pwrsts flags
clk: qcom: clk-alpha-pll: Update set_rate for Zonda PLL
clk: qcom: clk-alpha-pll: Fix zonda set_rate failure when PLL is disabled
clk: qcom: clk-alpha-pll: Fix the trion pll postdiv set rate API
clk: qcom: clk-alpha-pll: Fix the pll post div mask
Merge tag 'pinctrl-v6.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fix from Linus Walleij:
"A single fix for Qualcomm laptops that are affected by
missing wakeup IRQs"
* tag 'pinctrl-v6.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: qcom: x1e80100: Bypass PDC wakeup parent for now
Merge tag 'linux_kselftest-kunit-fixes-6.11-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
PullKUnit fix from Shuah Khan:
"Fix to a missing function parameter warning found during documentation
build in linux-next"
* tag 'linux_kselftest-kunit-fixes-6.11-rc7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: Fix missing kerneldoc comment
Merge tag 'pci-v6.11-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci fixes from Bjorn Helgaas:
- Unregister platform devices for child nodes when stopping a PCI
device, even if the PCI core has already cleared the OF_POPULATED bit
and of_platform_depopulate() doesn't do anything (Bartosz
Golaszewski)
- Rescan the bus from a separate thread so we don't deadlock when
triggering rescan from sysfs (Bartosz Golaszewski)
* tag 'pci-v6.11-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI/pwrctl: Rescan bus on a separate thread
PCI: Don't rely on of_platform_depopulate() for reused OF-nodes
Merge tag 'v6.11-rc6-cifs-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- fix potential mount hang
- fix retry problem in two types of compound operations
- important netfs integration fix in SMB1 read paths
- fix potential uninitialized zero point of inode
- minor patch to improve debugging for potential crediting problems
* tag 'v6.11-rc6-cifs-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
netfs, cifs: Improve some debugging bits
cifs: Fix SMB1 readv/writev callback in the same way as SMB2/3
cifs: Fix zero_point init on inode initialisation
smb: client: fix double put of @cfile in smb2_set_path_size()
smb: client: fix double put of @cfile in smb2_rename_path()
smb: client: fix hang in wait_for_response() for negproto
KVM: x86: don't fall through case statements without annotations
clang warns on this because it has an unannotated fall-through between
cases:
arch/x86/kvm/x86.c:4819:2: error: unannotated fall-through between switch labels [-Werror,-Wimplicit-fallthrough]
and while we could annotate it as a fallthrough, the proper fix is to
just add the break for this case, instead of falling through to the
default case and the break there.
gcc also has that warning, but it looks like gcc only warns for the
cases where they fall through to "real code", rather than to just a
break. Odd.
Fixes: d30d9ee94cc0 ("KVM: x86: Only advertise KVM_CAP_READONLY_MEM when supported by VM") Cc: Paolo Bonzini <[email protected]> Cc: Tom Dohrmann <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fix from Catalin Marinas:
"Fix the arm64 usage of ftrace_graph_ret_addr() to pass the
&state->graph_idx pointer instead of NULL, otherwise this function
just returns early"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: stacktrace: fix the usage of ftrace_graph_ret_addr()
Merge tag 'riscv-for-linus-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A revert for the mmap() change that ties the allocation range to the
hint adress, as what we tried to do ended up regressing on other
userspace workloads.
- A fix to avoid a kernel memory leak when emulating misaligned
accesses from userspace.
- A Kconfig fix for toolchain vector detection, which now correctly
detects vector support on toolchains where the V extension depends on
the M extension.
- A fix to avoid failing the linear mapping bootmem bounds check on
NOMMU systems.
- A fix for early alternatives on relocatable kernels.
* tag 'riscv-for-linus-6.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Fix RISCV_ALTERNATIVE_EARLY
riscv: Do not restrict memory size because of linear mapping on nommu
riscv: Fix toolchain vector detection
riscv: misaligned: Restrict user access to kernel memory
riscv: mm: Do not restrict mmap address based on hint
riscv: selftests: Remove mmap hint address checks
Revert "RISC-V: mm: Document mmap changes"