💾 Archived View for dioskouroi.xyz › thread › 29392402 captured on 2021-11-30 at 20:18:30. Gemini links have been rewritten to link to archived content

View Raw

More Information

-=-=-=-=-=-=-

Reading NFS at 25GB/s using FIO and libnfs

Author: tarasglek

Score: 104

Comments: 20

Date: 2021-11-30 14:23:58

Web Link

________________________________________________________________________________

fefe23 wrote at 2021-11-30 18:45:20:

My takeaway from this is: you think your company is great.

No other information was transferred to me in this article.

I learned nothing about FIO, libnfs, NFS, or your patch.

No feature comparison with other efforts.

No benchmark comparison with other efforts.

I wasted my time here.

guenthert wrote at 2021-11-30 18:51:06:

Hmmh, there might be room for improvement, but I appreciate the article and the link here to it. It's been a while that I tried to get the most out of a NFS server and was unaware that it is still used for high performance applications. 25GB/s surely impressed me.

stingraycharles wrote at 2021-11-30 19:56:13:

As someone who recently spent two months trying to squeeze the highest perf out of a 160GBit NFS cluster, I’d say the article lacks a _lot_ of detail.

But it’s good to know there are ways to get much more perf out of it.

packetslave wrote at 2021-11-30 23:28:25:

Six sentences and you managed to mention yourself 5 times. Do better.

0xdky wrote at 2021-11-30 15:39:21:

I did something similar (~2015) but using the kernel NFS client and having multiple mounts to the same volume using different IP addresses.

Using vectored IO and spreading across multiple connections greatly improved throughout. However, metadata operations cannot be parallelized easily without application side changes.

In more modern kernels, NFS supports ‘nconnect’ mount option to open multiple network connections for a single mount. I wonder if the approach of using libnfs for multiple connections is even required.

https://github.com/0xdky/iotrap

gjs278 wrote at 2021-11-30 21:50:48:

oh nice. I just gave nconnect a try and things do appear to be faster for a remote mount I use.

JoshTriplett wrote at 2021-11-30 20:04:21:

I've tried to drive NFS to reasonable levels of performance in the past, and the bottleneck has never seemed like storage or network or the NFS server; it always seems like the combination of the built-in NFS client in the Linux kernel, the implementation of filesystem semantics, and the behavior of common workloads ends up making NFS _much_ slower than network throughput. I'm impressed with these benchmarks and the approach to collecting them, but I'm wondering if the Linux kernel NFS client can get anywhere close to the theoretical limits.

I've tried this with a _read-only_ NFS server, across an AWS multi-gigabit connection, and I still found that I couldn't get anywhere near this level of performance for the workload of "make -j$(nproc)" in a Linux kernel tree. As a quick baseline, some numbers from the last time I tested this, with a c5.12xlarge (48 CPU) client and server: a defconfig local build was 40s, and a defconfig build with a Linux kernel tree in read-only NFS (with a tmpfs overlay on top for writability) was 6m55s. That's a 10x slowdown. System stats during the build showed 4-5MBps net recv and net send, and 1-4MBps disk write.

Is there some well-known method to getting reasonable performance out of off-the-shelf NFS servers and clients?

geertj wrote at 2021-11-30 20:47:34:

(PM-T for Amazon EFS, AWS's native NFSv4.1 file system)

Performance turning NFS is difficult mostly because the information on how to do it isn't readily available. The two things that most people run into:

- 'Close to open' cache consistency. In practical terms this means that open() is a round trip to the server to validate any data that might be cached already (unless you use delegations), write() goes into the page cache (as a writeback cache), and close() flushes all dirty data. Building a kernel tree reading and creating tons of small files, each of which requires two serial round trips over the network. Compare that to a local fs where neither open(O_CREAT), write() or close() actually go to disk and therefore run at memory speed (unless you use things like O_DIRECT or fsync/fdatasync()).

- Per-TCP flow throughput limitations. On the AWS network the per-flow limit is 5 Gbit in general and 10 Gbit within a placement groups. To work around this, people use the 'nconnect' mount option. (which does not work currently with EFS). Local networks might have different limitations, but single TCP streams will typically always have some bw limit lower than the physical network bandwidth. I believe that this (very cool!) fio plugin works around this by using multiple connections.

The actual data write latency of NFS servers isn't terribly different from local file systems.

Today, the best way to get the most performance out of NFS is to either use large files and/or keep files open, or use high concurrency. By default, the 4.1 client will issue up to 64 concurrent requests, which can be increased by increasing the 'max slots' NFS kernel module parameter. In your example of a kernel build, you could -j much higher than the number of CPUs because the compile jobs will be IO bound on reading input and writing output. This will amortize the round trips over more threads, and in theory (barring any other bottlenecks) reduce your build times.

JoshTriplett wrote at 2021-11-30 21:22:52:

Thank you _very_ much for the response!

> - 'Close to open' cache consistency. In practical terms this means that open() is a round trip to the server to validate any data that might be cached already (unless you use delegations), write() goes into the page cache (as a writeback cache), and close() flushes all dirty data. Building a kernel tree reading and creating tons of small files, each of which requires two serial round trips over the network. Compare that to a local fs where neither open(O_CREAT), write() or close() actually go to disk and therefore run at memory speed (unless you use things like O_DIRECT or fsync/fdatasync()).

That definitely sounds like a concern for writable NFS filesystems, but I was benchmarking reads to a read-only NFS mount.

Related: Is there some option I can pass to make it clear that the data _on the server_ will never change and thus no possible write-to-read or close-to-open consistency issues can arise?

> - Per-TCP flow throughput limitations. On the AWS network the per-flow limit is 5 Gbit in general and 10 Gbit within a placement groups. To work around this, people use the 'nconnect' mount option. (which does not work currently with EFS). Local networks might have different limitations, but single TCP streams will typically always have some bw limit lower than the physical network bandwidth. I believe that this (very cool!) fio plugin works around this by using multiple connections.

Interesting! I've never seen the per-flow limit mentioned before. Is that documented somewhere?

I'd be concerned about that if I were getting anywhere _close_ to that limit, but I was experiencing 4-5MBps network throughput. It seemed like individual file operations (like stat) were taking an excessive amount of time.

> In your example of a kernel build, you could -j much higher than the number of CPUs because the compile jobs will be IO bound on reading input and writing output.

I'm writing output to a local tmpfs (via overlayfs), not to NFS. And I'd love to tune the NFS setup to the point that reads (and stats) from NFS aren't causing a 10x slowdown.

geertj wrote at 2021-11-30 23:11:01:

> Related: Is there some option I can pass to make it clear that the data on the server will never change and thus no possible write-to-read or close-to-open consistency issues can arise?

As far as I know the NFS client does not support such a mount option today. I should have mentioned this, but there /is/ a way to eliminate the 'close to open' cache check for repeated open() operations, which is to use NFS delegations. NFS read delegations are supported by both nfsd and the NFS client. They are not perfect, as they are best effort, but can typically keep the core data set of your workload fully local. This would not work for your first build but would work for the second.

> Interesting! I've never seen the per-flow limit mentioned before. Is that documented somewhere?

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-inst...

chasil wrote at 2021-11-30 23:45:48:

Isn't NFS over RDMA an option to vastly increase performance?

I really only know the name, nothing on how to configure it.

https://developer.nvidia.com/blog/doubling-network-file-syst...

ninkendo wrote at 2021-11-30 15:33:13:

Any word on the iops numbers vs the in-kernel NFS client? The throughput is impressive but IME it ends up being the stat/fd activity of NFS clients that's the the limiting factor (try running `ls -l` in an NFS directory with lots of files in it, even worse if there are lots of symlinks involved.)

diamondlovesyou wrote at 2021-11-30 15:49:17:

Presumably, with async you could queue all the `stat`-ops after the dir walk at once, leaving the total latency in the ball park of "dir walk"+"ping for file stat"+"ping remaining symlink stats if present". But I don't think `ls` does this.

Otherwise, yeah, you incur network latency on every file, plus, as you say, symlink "pings" if those are present. So its "dir walk"+"number of files"*"ping"+"symlinks"*"ping", which adds up.

Batching high-latency ops is one of the only cases where I like async.

mprovost wrote at 2021-11-30 16:57:12:

NFS version 3 already has the READDIRPLUS operation which returns the directory contents and all of the stats together in one call to improve performance for this case. Sometimes the kernel client doesn't use READDIRPLUS and falls back to issuing a bunch of requests - usually because it's being super conservative about security (as if there is such a thing in NFS). It's also not that straightforward for a utility like ls to tell the kernel that it's reading both the directory and the stats for each file - they're separate system calls and the kernel has to figure out what the program is trying to do and optimise.

I wrote a tool that issues raw READDIRPLUS requests to list a directory:

https://rawgit.com/mprovost/NFStash/master/man/nfsls.8.html

tarasglek wrote at 2021-11-30 17:11:13:

Usually going via kernel nfs client will use up more memory bandwidth. I would expect lower per-client numbers. From what I've read you go from 3 memcopies on userspace to 4 with kernel nfs.

I haven't yet instrumented memory bandwidth on my amd machines, but it feels like I'm at the limit.

tarasglek wrote at 2021-11-30 17:16:26:

In this case my job was to generate a synthetic workload to drive max bandwidth. So I didn't have to worry about metadata.

Metadata is more complicated to model, need realistic directory structures etc.

https://www.spec.org/sfs2014/

(2020 version) is a good test of a metadata-heavy workloads, but isn't open source like fio :(.

mprovost wrote at 2021-11-30 18:26:11:

The nice thing about NFS (v3) is that it's stateless, so you can just keep doing the same thing over and over. Messing around with metadata sucks but once you get the filehandle of the file that you want to read or write then you don't need it anymore.

I wrote a suite of tools that does all this and dumps the NFS transactions as JSON, maybe it can be useful:

https://github.com/mprovost/NFStash

guenthert wrote at 2021-11-30 19:15:04:

"This means that to establish multiple connections one must do something terrible like requiring NFS server to have multiple IPs"

Oh the horror.

guenthert wrote at 2021-11-30 18:51:23:

"To make use of multiple NICs I needed multiple NFS connections per NIC."

What? I presume "multiple NFS connections per host." was meant. Not sure what those NFS connections are supposed to be though. NFSv3 is (stateless) request/response protocol on top of TCP/IP connections. NFSv4 introduced sessions, but here NFSv3 was used, wasn't it?

gtirloni wrote at 2021-11-30 15:02:52:

TL;DR; Author integrated libnfs in the fio benchmark tool.

mayli wrote at 2021-11-30 22:14:21:

Thanks, however I come to this post after reading the full blog.