358 lines
16 KiB
ReStructuredText
358 lines
16 KiB
ReStructuredText
|
===========
|
||
|
NFS LOCALIO
|
||
|
===========
|
||
|
|
||
|
Overview
|
||
|
========
|
||
|
|
||
|
The LOCALIO auxiliary RPC protocol allows the Linux NFS client and
|
||
|
server to reliably handshake to determine if they are on the same
|
||
|
host. Select "NFS client and server support for LOCALIO auxiliary
|
||
|
protocol" in menuconfig to enable CONFIG_NFS_LOCALIO in the kernel
|
||
|
config (both CONFIG_NFS_FS and CONFIG_NFSD must also be enabled).
|
||
|
|
||
|
Once an NFS client and server handshake as "local", the client will
|
||
|
bypass the network RPC protocol for read, write and commit operations.
|
||
|
Due to this XDR and RPC bypass, these operations will operate faster.
|
||
|
|
||
|
The LOCALIO auxiliary protocol's implementation, which uses the same
|
||
|
connection as NFS traffic, follows the pattern established by the NFS
|
||
|
ACL protocol extension.
|
||
|
|
||
|
The LOCALIO auxiliary protocol is needed to allow robust discovery of
|
||
|
clients local to their servers. In a private implementation that
|
||
|
preceded use of this LOCALIO protocol, a fragile sockaddr network
|
||
|
address based match against all local network interfaces was attempted.
|
||
|
But unlike the LOCALIO protocol, the sockaddr-based matching didn't
|
||
|
handle use of iptables or containers.
|
||
|
|
||
|
The robust handshake between local client and server is just the
|
||
|
beginning, the ultimate use case this locality makes possible is the
|
||
|
client is able to open files and issue reads, writes and commits
|
||
|
directly to the server without having to go over the network. The
|
||
|
requirement is to perform these loopback NFS operations as efficiently
|
||
|
as possible, this is particularly useful for container use cases
|
||
|
(e.g. kubernetes) where it is possible to run an IO job local to the
|
||
|
server.
|
||
|
|
||
|
The performance advantage realized from LOCALIO's ability to bypass
|
||
|
using XDR and RPC for reads, writes and commits can be extreme, e.g.:
|
||
|
|
||
|
fio for 20 secs with directio, qd of 8, 16 libaio threads:
|
||
|
- With LOCALIO:
|
||
|
4K read: IOPS=979k, BW=3825MiB/s (4011MB/s)(74.7GiB/20002msec)
|
||
|
4K write: IOPS=165k, BW=646MiB/s (678MB/s)(12.6GiB/20002msec)
|
||
|
128K read: IOPS=402k, BW=49.1GiB/s (52.7GB/s)(982GiB/20002msec)
|
||
|
128K write: IOPS=11.5k, BW=1433MiB/s (1503MB/s)(28.0GiB/20004msec)
|
||
|
|
||
|
- Without LOCALIO:
|
||
|
4K read: IOPS=79.2k, BW=309MiB/s (324MB/s)(6188MiB/20003msec)
|
||
|
4K write: IOPS=59.8k, BW=234MiB/s (245MB/s)(4671MiB/20002msec)
|
||
|
128K read: IOPS=33.9k, BW=4234MiB/s (4440MB/s)(82.7GiB/20004msec)
|
||
|
128K write: IOPS=11.5k, BW=1434MiB/s (1504MB/s)(28.0GiB/20011msec)
|
||
|
|
||
|
fio for 20 secs with directio, qd of 8, 1 libaio thread:
|
||
|
- With LOCALIO:
|
||
|
4K read: IOPS=230k, BW=898MiB/s (941MB/s)(17.5GiB/20001msec)
|
||
|
4K write: IOPS=22.6k, BW=88.3MiB/s (92.6MB/s)(1766MiB/20001msec)
|
||
|
128K read: IOPS=38.8k, BW=4855MiB/s (5091MB/s)(94.8GiB/20001msec)
|
||
|
128K write: IOPS=11.4k, BW=1428MiB/s (1497MB/s)(27.9GiB/20001msec)
|
||
|
|
||
|
- Without LOCALIO:
|
||
|
4K read: IOPS=77.1k, BW=301MiB/s (316MB/s)(6022MiB/20001msec)
|
||
|
4K write: IOPS=32.8k, BW=128MiB/s (135MB/s)(2566MiB/20001msec)
|
||
|
128K read: IOPS=24.4k, BW=3050MiB/s (3198MB/s)(59.6GiB/20001msec)
|
||
|
128K write: IOPS=11.4k, BW=1430MiB/s (1500MB/s)(27.9GiB/20001msec)
|
||
|
|
||
|
FAQ
|
||
|
===
|
||
|
|
||
|
1. What are the use cases for LOCALIO?
|
||
|
|
||
|
a. Workloads where the NFS client and server are on the same host
|
||
|
realize improved IO performance. In particular, it is common when
|
||
|
running containerised workloads for jobs to find themselves
|
||
|
running on the same host as the knfsd server being used for
|
||
|
storage.
|
||
|
|
||
|
2. What are the requirements for LOCALIO?
|
||
|
|
||
|
a. Bypass use of the network RPC protocol as much as possible. This
|
||
|
includes bypassing XDR and RPC for open, read, write and commit
|
||
|
operations.
|
||
|
b. Allow client and server to autonomously discover if they are
|
||
|
running local to each other without making any assumptions about
|
||
|
the local network topology.
|
||
|
c. Support the use of containers by being compatible with relevant
|
||
|
namespaces (e.g. network, user, mount).
|
||
|
d. Support all versions of NFS. NFSv3 is of particular importance
|
||
|
because it has wide enterprise usage and pNFS flexfiles makes use
|
||
|
of it for the data path.
|
||
|
|
||
|
3. Why doesn’t LOCALIO just compare IP addresses or hostnames when
|
||
|
deciding if the NFS client and server are co-located on the same
|
||
|
host?
|
||
|
|
||
|
Since one of the main use cases is containerised workloads, we cannot
|
||
|
assume that IP addresses will be shared between the client and
|
||
|
server. This sets up a requirement for a handshake protocol that
|
||
|
needs to go over the same connection as the NFS traffic in order to
|
||
|
identify that the client and the server really are running on the
|
||
|
same host. The handshake uses a secret that is sent over the wire,
|
||
|
and can be verified by both parties by comparing with a value stored
|
||
|
in shared kernel memory if they are truly co-located.
|
||
|
|
||
|
4. Does LOCALIO improve pNFS flexfiles?
|
||
|
|
||
|
Yes, LOCALIO complements pNFS flexfiles by allowing it to take
|
||
|
advantage of NFS client and server locality. Policy that initiates
|
||
|
client IO as closely to the server where the data is stored naturally
|
||
|
benefits from the data path optimization LOCALIO provides.
|
||
|
|
||
|
5. Why not develop a new pNFS layout to enable LOCALIO?
|
||
|
|
||
|
A new pNFS layout could be developed, but doing so would put the
|
||
|
onus on the server to somehow discover that the client is co-located
|
||
|
when deciding to hand out the layout.
|
||
|
There is value in a simpler approach (as provided by LOCALIO) that
|
||
|
allows the NFS client to negotiate and leverage locality without
|
||
|
requiring more elaborate modeling and discovery of such locality in a
|
||
|
more centralized manner.
|
||
|
|
||
|
6. Why is having the client perform a server-side file OPEN, without
|
||
|
using RPC, beneficial? Is the benefit pNFS specific?
|
||
|
|
||
|
Avoiding the use of XDR and RPC for file opens is beneficial to
|
||
|
performance regardless of whether pNFS is used. Especially when
|
||
|
dealing with small files its best to avoid going over the wire
|
||
|
whenever possible, otherwise it could reduce or even negate the
|
||
|
benefits of avoiding the wire for doing the small file I/O itself.
|
||
|
Given LOCALIO's requirements the current approach of having the
|
||
|
client perform a server-side file open, without using RPC, is ideal.
|
||
|
If in the future requirements change then we can adapt accordingly.
|
||
|
|
||
|
7. Why is LOCALIO only supported with UNIX Authentication (AUTH_UNIX)?
|
||
|
|
||
|
Strong authentication is usually tied to the connection itself. It
|
||
|
works by establishing a context that is cached by the server, and
|
||
|
that acts as the key for discovering the authorisation token, which
|
||
|
can then be passed to rpc.mountd to complete the authentication
|
||
|
process. On the other hand, in the case of AUTH_UNIX, the credential
|
||
|
that was passed over the wire is used directly as the key in the
|
||
|
upcall to rpc.mountd. This simplifies the authentication process, and
|
||
|
so makes AUTH_UNIX easier to support.
|
||
|
|
||
|
8. How do export options that translate RPC user IDs behave for LOCALIO
|
||
|
operations (eg. root_squash, all_squash)?
|
||
|
|
||
|
Export options that translate user IDs are managed by nfsd_setuser()
|
||
|
which is called by nfsd_setuser_and_check_port() which is called by
|
||
|
__fh_verify(). So they get handled exactly the same way for LOCALIO
|
||
|
as they do for non-LOCALIO.
|
||
|
|
||
|
9. How does LOCALIO make certain that object lifetimes are managed
|
||
|
properly given NFSD and NFS operate in different contexts?
|
||
|
|
||
|
See the detailed "NFS Client and Server Interlock" section below.
|
||
|
|
||
|
RPC
|
||
|
===
|
||
|
|
||
|
The LOCALIO auxiliary RPC protocol consists of a single "UUID_IS_LOCAL"
|
||
|
RPC method that allows the Linux NFS client to verify the local Linux
|
||
|
NFS server can see the nonce (single-use UUID) the client generated and
|
||
|
made available in nfs_common. This protocol isn't part of an IETF
|
||
|
standard, nor does it need to be considering it is Linux-to-Linux
|
||
|
auxiliary RPC protocol that amounts to an implementation detail.
|
||
|
|
||
|
The UUID_IS_LOCAL method encodes the client generated uuid_t in terms of
|
||
|
the fixed UUID_SIZE (16 bytes). The fixed size opaque encode and decode
|
||
|
XDR methods are used instead of the less efficient variable sized
|
||
|
methods.
|
||
|
|
||
|
The RPC program number for the NFS_LOCALIO_PROGRAM is 400122 (as assigned
|
||
|
by IANA, see https://www.iana.org/assignments/rpc-program-numbers/ ):
|
||
|
Linux Kernel Organization 400122 nfslocalio
|
||
|
|
||
|
The LOCALIO protocol spec in rpcgen syntax is::
|
||
|
|
||
|
/* raw RFC 9562 UUID */
|
||
|
#define UUID_SIZE 16
|
||
|
typedef u8 uuid_t<UUID_SIZE>;
|
||
|
|
||
|
program NFS_LOCALIO_PROGRAM {
|
||
|
version LOCALIO_V1 {
|
||
|
void
|
||
|
NULL(void) = 0;
|
||
|
|
||
|
void
|
||
|
UUID_IS_LOCAL(uuid_t) = 1;
|
||
|
} = 1;
|
||
|
} = 400122;
|
||
|
|
||
|
LOCALIO uses the same transport connection as NFS traffic. As such,
|
||
|
LOCALIO is not registered with rpcbind.
|
||
|
|
||
|
NFS Common and Client/Server Handshake
|
||
|
======================================
|
||
|
|
||
|
fs/nfs_common/nfslocalio.c provides interfaces that enable an NFS client
|
||
|
to generate a nonce (single-use UUID) and associated short-lived
|
||
|
nfs_uuid_t struct, register it with nfs_common for subsequent lookup and
|
||
|
verification by the NFS server and if matched the NFS server populates
|
||
|
members in the nfs_uuid_t struct. The NFS client then uses nfs_common to
|
||
|
transfer the nfs_uuid_t from its nfs_uuids to the nn->nfsd_serv
|
||
|
clients_list from the nfs_common's uuids_list. See:
|
||
|
fs/nfs/localio.c:nfs_local_probe()
|
||
|
|
||
|
nfs_common's nfs_uuids list is the basis for LOCALIO enablement, as such
|
||
|
it has members that point to nfsd memory for direct use by the client
|
||
|
(e.g. 'net' is the server's network namespace, through it the client can
|
||
|
access nn->nfsd_serv with proper rcu read access). It is this client
|
||
|
and server synchronization that enables advanced usage and lifetime of
|
||
|
objects to span from the host kernel's nfsd to per-container knfsd
|
||
|
instances that are connected to nfs client's running on the same local
|
||
|
host.
|
||
|
|
||
|
NFS Client and Server Interlock
|
||
|
===============================
|
||
|
|
||
|
LOCALIO provides the nfs_uuid_t object and associated interfaces to
|
||
|
allow proper network namespace (net-ns) and NFSD object refcounting:
|
||
|
|
||
|
We don't want to keep a long-term counted reference on each NFSD's
|
||
|
net-ns in the client because that prevents a server container from
|
||
|
completely shutting down.
|
||
|
|
||
|
So we avoid taking a reference at all and rely on the per-cpu
|
||
|
reference to the server (detailed below) being sufficient to keep
|
||
|
the net-ns active. This involves allowing the NFSD's net-ns exit
|
||
|
code to iterate all active clients and clear their ->net pointers
|
||
|
(which are needed to find the per-cpu-refcount for the nfsd_serv).
|
||
|
|
||
|
Details:
|
||
|
|
||
|
- Embed nfs_uuid_t in nfs_client. nfs_uuid_t provides a list_head
|
||
|
that can be used to find the client. It does add the 16-byte
|
||
|
uuid_t to nfs_client so it is bigger than needed (given that
|
||
|
uuid_t is only used during the initial NFS client and server
|
||
|
LOCALIO handshake to determine if they are local to each other).
|
||
|
If that is really a problem we can find a fix.
|
||
|
|
||
|
- When the nfs server confirms that the uuid_t is local, it moves
|
||
|
the nfs_uuid_t onto a per-net-ns list in NFSD's nfsd_net.
|
||
|
|
||
|
- When each server's net-ns is shutting down - in a "pre_exit"
|
||
|
handler, all these nfs_uuid_t have their ->net cleared. There is
|
||
|
an rcu_synchronize() call between pre_exit() handlers and exit()
|
||
|
handlers so any caller that sees nfs_uuid_t ->net as not NULL can
|
||
|
safely manage the per-cpu-refcount for nfsd_serv.
|
||
|
|
||
|
- The client's nfs_uuid_t is passed to nfsd_open_local_fh() so it
|
||
|
can safely dereference ->net in a private rcu_read_lock() section
|
||
|
to allow safe access to the associated nfsd_net and nfsd_serv.
|
||
|
|
||
|
So LOCALIO required the introduction and use of NFSD's percpu_ref to
|
||
|
interlock nfsd_destroy_serv() and nfsd_open_local_fh(), to ensure each
|
||
|
nn->nfsd_serv is not destroyed while in use by nfsd_open_local_fh(), and
|
||
|
warrants a more detailed explanation:
|
||
|
|
||
|
nfsd_open_local_fh() uses nfsd_serv_try_get() before opening its
|
||
|
nfsd_file handle and then the caller (NFS client) must drop the
|
||
|
reference for the nfsd_file and associated nn->nfsd_serv using
|
||
|
nfs_file_put_local() once it has completed its IO.
|
||
|
|
||
|
This interlock working relies heavily on nfsd_open_local_fh() being
|
||
|
afforded the ability to safely deal with the possibility that the
|
||
|
NFSD's net-ns (and nfsd_net by association) may have been destroyed
|
||
|
by nfsd_destroy_serv() via nfsd_shutdown_net() -- which is only
|
||
|
possible given the nfs_uuid_t ->net pointer managemenet detailed
|
||
|
above.
|
||
|
|
||
|
All told, this elaborate interlock of the NFS client and server has been
|
||
|
verified to fix an easy to hit crash that would occur if an NFSD
|
||
|
instance running in a container, with a LOCALIO client mounted, is
|
||
|
shutdown. Upon restart of the container and associated NFSD the client
|
||
|
would go on to crash due to NULL pointer dereference that occurred due
|
||
|
to the LOCALIO client's attempting to nfsd_open_local_fh(), using
|
||
|
nn->nfsd_serv, without having a proper reference on nn->nfsd_serv.
|
||
|
|
||
|
NFS Client issues IO instead of Server
|
||
|
======================================
|
||
|
|
||
|
Because LOCALIO is focused on protocol bypass to achieve improved IO
|
||
|
performance, alternatives to the traditional NFS wire protocol (SUNRPC
|
||
|
with XDR) must be provided to access the backing filesystem.
|
||
|
|
||
|
See fs/nfs/localio.c:nfs_local_open_fh() and
|
||
|
fs/nfsd/localio.c:nfsd_open_local_fh() for the interface that makes
|
||
|
focused use of select nfs server objects to allow a client local to a
|
||
|
server to open a file pointer without needing to go over the network.
|
||
|
|
||
|
The client's fs/nfs/localio.c:nfs_local_open_fh() will call into the
|
||
|
server's fs/nfsd/localio.c:nfsd_open_local_fh() and carefully access
|
||
|
both the associated nfsd network namespace and nn->nfsd_serv in terms of
|
||
|
RCU. If nfsd_open_local_fh() finds that the client no longer sees valid
|
||
|
nfsd objects (be it struct net or nn->nfsd_serv) it returns -ENXIO
|
||
|
to nfs_local_open_fh() and the client will try to reestablish the
|
||
|
LOCALIO resources needed by calling nfs_local_probe() again. This
|
||
|
recovery is needed if/when an nfsd instance running in a container were
|
||
|
to reboot while a LOCALIO client is connected to it.
|
||
|
|
||
|
Once the client has an open nfsd_file pointer it will issue reads,
|
||
|
writes and commits directly to the underlying local filesystem (normally
|
||
|
done by the nfs server). As such, for these operations, the NFS client
|
||
|
is issuing IO to the underlying local filesystem that it is sharing with
|
||
|
the NFS server. See: fs/nfs/localio.c:nfs_local_doio() and
|
||
|
fs/nfs/localio.c:nfs_local_commit().
|
||
|
|
||
|
Security
|
||
|
========
|
||
|
|
||
|
Localio is only supported when UNIX-style authentication (AUTH_UNIX, aka
|
||
|
AUTH_SYS) is used.
|
||
|
|
||
|
Care is taken to ensure the same NFS security mechanisms are used
|
||
|
(authentication, etc) regardless of whether LOCALIO or regular NFS
|
||
|
access is used. The auth_domain established as part of the traditional
|
||
|
NFS client access to the NFS server is also used for LOCALIO.
|
||
|
|
||
|
Relative to containers, LOCALIO gives the client access to the network
|
||
|
namespace the server has. This is required to allow the client to access
|
||
|
the server's per-namespace nfsd_net struct. With traditional NFS, the
|
||
|
client is afforded this same level of access (albeit in terms of the NFS
|
||
|
protocol via SUNRPC). No other namespaces (user, mount, etc) have been
|
||
|
altered or purposely extended from the server to the client.
|
||
|
|
||
|
Testing
|
||
|
=======
|
||
|
|
||
|
The LOCALIO auxiliary protocol and associated NFS LOCALIO read, write
|
||
|
and commit access have proven stable against various test scenarios:
|
||
|
|
||
|
- Client and server both on the same host.
|
||
|
|
||
|
- All permutations of client and server support enablement for both
|
||
|
local and remote client and server.
|
||
|
|
||
|
- Testing against NFS storage products that don't support the LOCALIO
|
||
|
protocol was also performed.
|
||
|
|
||
|
- Client on host, server within a container (for both v3 and v4.2).
|
||
|
The container testing was in terms of podman managed containers and
|
||
|
includes successful container stop/restart scenario.
|
||
|
|
||
|
- Formalizing these test scenarios in terms of existing test
|
||
|
infrastructure is on-going. Initial regular coverage is provided in
|
||
|
terms of ktest running xfstests against a LOCALIO-enabled NFS loopback
|
||
|
mount configuration, and includes lockdep and KASAN coverage, see:
|
||
|
https://evilpiepirate.org/~testdashboard/ci?user=snitzer&branch=snitm-nfs-next
|
||
|
https://github.com/koverstreet/ktest
|
||
|
|
||
|
- Various kdevops testing (in terms of "Chuck's BuildBot") has been
|
||
|
performed to regularly verify the LOCALIO changes haven't caused any
|
||
|
regressions to non-LOCALIO NFS use cases.
|
||
|
|
||
|
- All of Hammerspace's various sanity tests pass with LOCALIO enabled
|
||
|
(this includes numerous pNFS and flexfiles tests).
|