💾 Archived View for sashanoraa.gay › faering-report captured on 2022-06-03 at 22:58:09. Gemini links have been rewritten to link to archived content

-=-=-=-=-=-=-

Fast I/O Using io_uring in Rust

Written April 2022

As a part of my Computer Science Master's program, I built a async I/O runtime for Rust based on the new Linux kernel API io_uring. I dubbed this library faering. It's open source and the source code, however the library is not fully public yet as it is unfinished.

Faering Source Code

As a part of that project I had to write a report. I've decided to convert that report into gemtext and publish it here. I converted this from Latex by hand so there my be errors. That's also why it's formatted the way it is with numbered figures and listings and references as the end. If you are reading this in the future this information is probably out of date and may no longer be representative of the current state of faering.

Home Page

Abstract

io_uring is a new API in the Linux kernel that allows programs to perform asynchronous I/O faster than was previously possible, by using shared memory instead of syscalls. Rust is a relatively new systems programming language that guarantees memory safety and has good support for writing async code. It seems natural that given how Rust allows programmers to easily write fast async code, it would desirable to be able to use io_uring in Rust.

This project created a high level memory safe library to do async I/O in Rust using the new io_uring API in the Linux kernel. This problem presented a design challenge. The commonly used async I/O interface in Rust provided by the futures-io library is incompatible with how io_uring manages memory. To solve this incompatibility, this project created two separate APIs that the user can choose from. One provided compatibility with futures-io by using a runtime-managed buffer pool. The other broke compatibility with futures-io but provided better performance. The project is as of now unfinished. All of the described functionality was implemented, but bugs remained that severely limited its performance.

Introduction

This project created a high level memory safe library to do async I/O in Rust using the new io_uring API in the Linux kernel. To accomplish this, the project explored how io_uring works and how to implement a Rust async runtime. The main design challenge addressed was solving the incompatibility between the standard async I/O interface defined in the futures-io crate and io_uring.

The majority of this project is addressed in the sections Background, Design Problem, and Implementation. The Background section provides information that frames the definition of the problem being solved by this project. It is assumed that the reader has a basic understanding of asynchronous I/O and the Rust programming language, however four key areas are explained to prepare the reader with an understanding of these prerequisite concepts. The Design Problem section explains the problem that the solution design is intended to address. The Implementation section details the two different APIs provided and compares them, and the Experimental Results section explains the benchmarks tests and their results. Future possible directions for this work are explored, and other similar efforts are described. Finally, we conclude and acknowledge those who provided insights and inspiration to this project.

Background

We live in an age where it is desirable to have server software that can serve many hundreds or even thousands of users at once. Traditionally, we have used blocking I/O to communicate to clients, where each I/O operation (like read, write, etc.) is handled one at a time, and the program thread is paused while this I/O operation is in progress. This approach has served us well for many years by using thread pools to handle many connections at once. But, it has limits. Starting new threads is relatively slow. And each thread needs to keep a stack as well as other resources, even when idle. Having many threads running at once can start to eat up a substantial amount of memory resources. Modern applications can hit these limits, and so another approach is needed.

Async I/O on Linux Before io_uring

Enter, asynchronous (async) I/O. Async I/O allows us to scale applications even further by allowing one thread to perform many I/O operations at the same time. This is possible due to kernel support and good multi-tasking systems in various programming languages. The main async I/O interface in the Linux kernel is called epoll, which was added in 2002. It allows a thread to wait on any of several file descriptors to be ready for reading or writing, then perform a read or write. Use of epoll has increased in usage since modern programming languages have made async programming easier. To take advantage of async I/O you must manage multiple concurrent tasks within a single thread. This is very difficult to do manually, but has become much easier as programming languages have added built in support, often using either async await syntax (Rust, NodeJS, Python) or some kind of light weight thread mechanism (Golang)¹.

However, epoll has several limitations. One large one is that epoll can not be used with regular files (files on the disk), but does support network sockets among other things. It is possible to do async I/O on regular files using the aio API, but that has other drawbacks that make it difficult to use effectively.

io_uring

io_uring is a new API in Linux for doing async I/O. It solves many of the problems of epoll and also performs better. io_uring allows a user-space process to create a pair of queues, implemented as ring buffers, in memory that is shared between the user process and the kernel. One is the submission queue where the user process pushes I/O operations it wants the kernel to perform onto the queue. The kernel pops them off and performs them. The other is the completion queue where the kernel pushes completion events (containing a return code) for I/O operations it has completed, making them available to the user process². Figure 1 depicts how I/O operation requests and responses are communicated to the kernel via two ring buffers.

Figure 1: How I/O operation requests and responses are communication to the kernel via two ring buffers

io_uring tends to perform better than epoll because it uses shared memory instead of system calls for many of its operations. Since system calls are relatively slow on Linux, reducing the number of system calls needed to perform async I/O improves performance. io_uring also does not suffer from some of the limitations of epoll and can be used on regular files.

Async Rust

Modern Rust provides the async keyword which makes writing async code easy by allowing developers to write code that looks like regular threaded code, but is actually converted into futures. A future represents an asynchronous computation of a value that may not have finished computing yet⁴. In Rust, a future is a state machine implemented as a type that implements the Future⁴ trait. Code using the async keyword automatically generates these types, but they can also be written by hand. The Future trait has one method, poll, whose method signature is in Listing 1.

Listing 1

fn poll( self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output>;

The poll method attempts to resolve the future into its final value. It will either return ready (Poll::Ready(T)) with the final value or pending (Poll::Pending). If pending is returned, the future must be polled again later, to try to get the value again. The future may need to be polled many times to complete. The poll method does not block if the value is not ready. Just polling the future in a loop (busy waiting) is inefficient, so we also have a Waker associated with each future. This Waker is passed into the poll method as a part of the parameter cx. This Waker allows the async runtime to notify whatever is polling the future when it's ready to be polled again so that the poller can do other things while the future is processing. For futures that are polling I/O is common for them to be polled once to start the I/O operation. Then the kernel will notify the task via it's waker that the I/O is complete and the future will be polled a second time to complete the operation.

Async Rust also has the concept of tasks. A task is a future that is currently being run. A task is conceptually similar to a thread in that they are both a logical sequence of operations currently being performed. However they are different because a thread is managed by the operating system whereas a task is managed by the user process. Tasks also use less memory and are faster to start than threads. The code in the user process responsible for managing the currently running tasks is called the executor.

Rust Async Runtimes

In the Rust programming language, there is built-in support for writing async code. However, to actually use async I/O, we need an async runtime to bridge the gap between low-level system calls and high-level constructs like file and TCP socket objects. An async runtime often consists of many pieces. The two main ones are an I/O API which provides high level constructs like file and TCP socket, and an executor that runs asynchronous tasks, often across many threads. This project does not include an executor, however the futures it creates can be run on any executor. While this project may be able to be used with another runtime like Tokio, a standalone executor like the ones provided by the futures library or smol are preferred as they can be used without having to include another I/O interface. In the context of this paper, when I refer to an async runtime, I am referring to the I/O interface portion.

To provide this I/O interface, the runtime must bridge the gap between a high level API and raw system calls. To do this, the runtime must track all of the state associated with each resource, which in the case of this project are mostly file descriptors and ongoing I/O operation handles. Since the raw system calls that the runtime is making are unsafe, a big responsibility of the runtime is to ensure that the API it provides to users is memory-safe. Depending on the APIs of the syscalls the runtime is using, creating a memory-safe API around them can be quite difficult. Additionally, each operating system has different async I/O APIs, so a runtime that seeks to be cross-platform needs to abstract over multiple. This project however does not attempt to do this, and only supports io_uring and therefore Linux.

Design Problem

The purpose of this project is to create a memory-safe and nice-to-use high-level API around io_uring for Rust, while maintaining the performance benefits of io_uring. There were challenges in doing so, the primary one being a fundamental incompatibility between the existing established async I/O interfaces and io_uring's memory model.

There are two common kinds of APIs for doing async I/O, readiness APIs and completion APIs. A readiness API informs the user process when the kernel is ready to perform I/O (such that is doesn't block), whereas a completion API informs the user process when I/O is complete. epoll is a readiness API and io_uring is a completion API.

The Rust standard library currently does not define async read and write traits equivalent to std::io::Read and std::io::Write. However, the futures-io¹⁰ crate does define AsyncRead and AsyncWrite traits that have become effectively standardized across most of the ecosystem. These traits assume a readiness API is being used behind the scenes to wait for and perform I/O. It is difficult, if not impossible, to make a completion API conform to these traits without violating memory safety or losing performance, due to the conflict between these APIs.

To see this conflict let's use AsyncRead as an example. The AsyncRead trait has one method that we care about, called poll_read. See the method signature in Listing 2.

Listing 2

fn poll*read(self: Pin<&mut Self>, cx: &mut Context<'*>, buf: &mut [u8]) -> Poll<Result<usize>>;

The important part here is that this method takes a reference to a buffer buf and returns a Poll type. If you call this method and it returns Poll::Pending, you are meant to call it again later when the waker is notified that the kernel is ready to perform the I/O operation. For a readiness API this API works. buf is only modified the last time poll_read is called, when the read operation is actually performed.

For completion based APIs like io_uring this is a problem. When the user program initially submits its read request to the kernel, it gives the kernel a reference to the buffer it should read into. Then, at some point before the kernel notifies the user program that the I/O operation is complete, it will copy the data into that buffer. Logically, the kernel is holding a mutable reference of the buffer during the course of the I/O operation.

So, herein lies the conflict. io_uring wants to hold a mutable reference the buffer from the time the I/O operation is first submitted until it completes. But in AsyncRead the mutable reference to buf only lives as long as one call to poll_read. To work properly with io_uring, buf would have to live as long as the future object that is implementing AsyncRead. However, even that would not be enough, because the future object could be dropped before the I/O operation completes, freeing buf while the kernel may still be modifying it.

This project provides two different solutions to solve this incompatibility, through two different APIs in the library. Users of the library can then choose which API they would like to use based on the trade-offs outlined below.

The first maintains compatibility with the futures-io traits and is depicted in Figure 2.

Figure 2: Compatible API requires an extra copy from the library's buffer to the user's buffer.

To avoid the problem outlined above, the runtime will pass pointers to buffers it manages to io_uring when submitting I/O operations. Then when the operations are completed, it will copy the data from its buffer into the user's buffer in the final call to poll_read or equivalent method. This guarantees memory safety because the runtime manages the buffer that the kernel is modifying, and therefore can guarantee it will remain valid memory for the entire I/O operation. The downside of this approach is that it introduces an extra memory copy which is not optimal for a performance-sensitive system.

The second solution breaks compatibility with the futures-io traits, but does not have the extra copy, and is depicted in Figure 3.

Figure 3: Copy-less API avoids the extra copy by passing ownership of the buffer to the library and back.

This solution introduces a new API where the futures that perform the I/O operations take ownership of the buffer they operate on when they're created instead of taking a reference to the buffer when they're polled. After the future is created, the ownership of the buffer is then given to the runtime so that if the future is dropped prematurely, the buffer will not be freed until the I/O operation is complete. As stated before, the main downside to this solution is that it breaks compatibility with the futures-io traits, and therefore can't integrate with other libraries that expect those traits to be used.

The copy-less API uses Listing 3 as its alternative to the poll_read method in Listing 2. It has two methods instead of one. start_read takes the context cx, owned buffer buf, and optionally an offset to read from offset. It starts a read operation, reading into buf and returns the ID of the read operation that was started. Then poll_read_done is used to check of the read operation with ID op_id is complete. This method returns the typical Poll type used by futures in Rust.

Listing 3

fn start*read(&self, cx: &mut Context<'*>, buf: Buf, offset: Option<usize>) -> Result<OpId>;

fn poll_read_done(&self, op_id: OpId) -> Poll<Result<(Buf, usize)>>;

To summarize the key differences between the copy-less and compatible APIs, Figure 4 depicts side by side how the two APIs handle ownership of the buffer differently over the course of a read operation.

Figure 4: Comparison of how the two APIs handle ownership of the buffer

Implementation

To achieve the goals of this project, the author has implemented a library called faering¹⁸. This section will explain in detail how the faering library functions.

Recall that the goal of this library is to provide a memory-safe abstraction over io_uring. This library is using the io_uring Rust library created by the Tokio project which provides a lower level Rust API to io_uring that is not memory-safe¹⁷. Faering builds on top of this to provide a high level memory-safe API.

The faering library is logically divided into three parts, the public I/O API, the reactor, and the driver. Each of these are described. Following that are two implementation examples, one for the new copy-less API and the other for the compliant API.

The Three Parts of the faering Library

The public I/O API allows the user to perform I/O operations using common abstractions like File and TcpStream. These types are modeled after the Rust standard library and should seem familiar to anyone who has used Rust. To make these functions and methods async, many of them return hand written futures that poll if I/O operations are complete. The work of actually submitting I/O events to the kernel, and notifying tasks when the I/O they are waiting on is complete, is delegated to the other two components: the reactor and the driver.

In the faering library the reactor is responsible for submitting I/O requests to the kernel via io_uring and completing them once the kernel is done performing the I/O operation. The reactor also coordinates access to the io_uring instance and manages various resources like buffers used in requests.

The reactor is a single global instance of the Reactor, type whose definition is in Listing 4. uring contains the structures for interacting with io_uring. The submission and completion queues in this structure are also wrapped in mutexes so that they can be modified by different threads. op_id_alloc is responsible for issuing operation IDs. ops is a hash map that maps operations' IDs to their states. buf_pool manages a pool of fixed-sized buffers that are used for the coppy-less API. This is used instead of allocating a buffer per operation to avoid unnecessary allocations. Rather, existing buffers are just reused. backlog is a list of operations that could not be submitted to io_uring because the submission queue was full. The reactor will submit these later when other operations complete. wakers is a hash map that maps operations' IDs to the wakers for their tasks. cleanup_list holds the IDs of I/O operations that the user dropped without completing them. The reactor will take these and free the resources associated with them, once the I/O operations associated with them are complete.

Listing 4

struct Reactor {
    uring: Uring,
    op_id_alloc: OpIdAlloc,
    ops: CHashMap<OpId, OpState>,
    buf_pool: Mutex<BufPool>,
    backlog: Mutex<VecDeque<Entry>>,
    wakers: Mutex<HashMap<OpId, Waker>>,
    cleanup_list: Mutex<HashSet<OpId>>,
}

The reactor has two main functions, submitting I/O to io_uring and reacting to I/O finishing. The submission end involves allocating the necessary resources for the given operation, constructing the submissions queue entry to give to io_uring, then performing the submission. The completion end reacts to I/O operation finishing by waiting for the I/O operations to finish, collecting their return codes, then notifying the task waiting on those I/O operations that they can resume. The completion end also performs several cleanup tasks.

The driver is responsible for driving the completion end of the reactor. A thread must wait for I/O operations to complete and notify tasks when they do. The driver is responsible for doing this, whether in a background thread or as a part of the main thread in a single threaded program.

The way the driver drives the completion end of the reactor is by repeatedly calling the reactor's react method. In a multi-threaded program, the driver will simply start a background thread that will call react in an infinite loop. The driver background thread is automatically started the first time faering submits an I/O operation outside of a future being run with the block_on function. However, if the user's program only has a single thread, it may be undesirable for faering to start a background thread for the driver. In this case, the user can use faering's block_on function to run the driver as a part of the main thread.

Listing 5 contains the code for the block_on function. It takes a future as an argument, then repeatedly polls that future until it is complete. Lines 13-16 create the waker and Context object that contains the waker to give to the future when it is polled. The waker calls the reactor's notify method which will stop it from waiting for I/O operations to complete. Lines 19-28 are block_on's main loop which polls the future then calls the reactor's react method if the future returns incomplete. By calling react after each time poll returns incomplete, we process the I/O events submitted during the last poll of the future. Additionally, by having the waker call notify, we make sure that react will return when the future is ready to be polled again even if I/O events have not completed. Lines 5-12 are there to prevent the driver background thread from starting when block_on is being used.

Listing 5

pub fn block_on<T>(future: impl Future<Output = T>) -> T {
    BLOCK_ON_ON_THIS_THREAD.with(|cell| cell.set(true));
    let _guard = CallOnDrop(|| {
        BLOCK_ON_ON_THIS_THREAD.with(|cell| cell.set(false));
    });
    let waker = waker_fn(move || Reactor::get().notify());
    let mut cx = Context::from_waker(&waker);
    pin!(future);

    loop {
        if let Poll::Ready(t) = future.as_mut().poll(&mut cx) {
            return t;
        }
        Reactor::get().react();
    }

}

This implementation of block_on has not been sufficiently tested for the author be confident it works correctly in all cases. If it is used in a multi-threaded program, it is likely to misbehave. However, multi-threaded mode, where the driver runs as a background thread, does work correctly.

API Examples

To better illustrate how faering works internally, the implementations of both the copy-less API and the compatible API are described, through the example of reading a file. The copy-less API uses owned buffers to avoid the extra copy as described in the previous section. The compatible API uses a buffer pool which requires the extra copy.

Figure 4: Diagram of how a read operation is implemented.

Figure 4 shows a high-level overview of the logic followed in the examples. The flow begins with a call to File's read method as performed by the user as a part of their task. This method call then causes the reactor to submit a read operation to the kernel via io_uring. At some point in the future, this operation completes and an entry is added to io_uring's completion queue. Then the completion end of the reactor as driven by the driver reacts to the I/O event completing and notifies the user's task that it can resume. The user's task then resumes at some point in the future and can move on in its logic.

Copy-less API Example

Listing 6 demonstrates reading from a file in faering. On line 1, we open a file at path. On line 2, we create a 1KB buffer to read data into. This Buf type is a special wrapper around a Vec provided by this library that allows slicing behavior on an owned buffer. Then on line 3, we read from the file. This is using the copy-less API that uses owned buffers to avoid the extra copy as described in the previous section. Next, we will dive deeper into how line 3 works.

Listing 6

let mut file = File::open(&path).await?;
let buf = Buf::new(1024);
let (buf, len) = file.read(buf).await?;

Note: In this and future code examples, it is assumed that the code is inside an async function when applicable and use statements may be omitted. Also code may have been modified for brevity and to better illustrate the core logic of the code.

The read method on the File type is provided by the UringRead trait which File implements. The code for the read method is displayed in Listing 7. The method simply creates a ReadFuture which performs the actual reading.

Listing 7

fn read<'a>(&'a mut self, buf: Buf) -> ReadFuture<'a, Self>
where
    Self: Unpin,
{
    ReadFuture {
        reader: self,
        op_id: None,
        buf: Some(buf),
    }
}

Below in Listing 8 is the type declaration for ReadFuture. It has three attributes: reader which holds a mutable reference to the File object we are reading from, op_id which will hold the id of the io_uring operation for this read, and buf which holds the buffer we will read into, although it will be taken out of the Option and handed off to the reactor once the I/O operation is started when poll is called for the first time.

Listing 8

struct ReadFuture<'a, R: Unpin + ?Sized> {
    reader: &'a mut R,
    op_id: Option<OpId>,
    buf: Option<Buf>,
}

Below in Listing 9 is the declaration of the poll method for ReadFuture's implementation of the Future trait. This method polls to see if the read operation is complete. The first call to poll actually starts the read operation, then subsequent calls check to see if the read is complete. In lines 7-13 the read operation is started by calling reader.start_read if buf is still a Some and the operation ID returned is stored in op_id. Then on lines 14-17 reader.poll_read_done is called to check if the I/O operation with ID op_id is complete.

Listing 9

fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_> ) -> Poll<Self::Output> {
    let Self { reader, op_id, buf } = &mut *self;
    if let Some(buf) = buf.take() {
        *op_id = Some(reader.start_read(cx, buf, None)?);
    }
    let ret = reader.poll_read_done(op_id.unwrap());
    if ret.is_ready() {
        *op_id = None;
    }
    ret
}

Next we look at the start_read method for File in Listing 10. This method takes the cx and buf from poll as well as an offset which is not used here. Other methods that call start_read can use offset to specify an offset to read from in the file. Other types like TcpStream that do not support reading from an offset ignore this parameter. The method calls the queue_op! macro to submit the read operation to the reactor. In addition to passing the parameters from start_read it also specifies ReadOwnedOp as its first parameter. This is a type that specifies which operation type should be queued. ReadOwnedOp is a read operation for the copy-less API. It is called such because it owns the buffer.

Listing 10

fn start_read(&self, cx: &mut Context<'_>, buf: Buf, offset: Option<usize>) -> Result<OpId> {
    queue_op!(ReadOwnedOp, cx, self.fd, buf, offset.map(|i| i as i64))
}

Listing 11 shows the implementation of the queue_op! macro. It gets the global reactor instance on line 1, then it calls ReadOwnedOp::create on line 2, then calls Reactor's queue_op method on line 3. Each operation type (like ReadOwnedOp) has a create and complete method. But these methods take different arguments depending on what is needed for that operation. The queue_op! macro calls the create method on the given operation type with the parameters supplied. Then it calls Reactor's queue_op method which performs the operation independent steps.

Listing 11

macro_rules! queue_op {
    ($op:ty, $cx:expr, $($op_args:tt)*) => {{
        use crate::reactor::Reactor;
        let reactor = Reactor::get();
        let (op, entry) = <$op>::create($($op_args)*)?;
        reactor.queue_op(Some($cx), op, entry)
    }}
}

The create methods of each operation type take different arguments but they all return the same things. They return an OpState which holds the operation type for this operation and will hold the operation's return code when it's complete. It also returns an Entry which is the operation data structure actually submitted to io_uring.

Listing 12 creates a Read io_uring operation on lines 8-14. That operation takes the file descriptor of the file we are reading from as well as the address and length of the buffer we are reading into. It also sets the offset to -1 if no offset was provided in the arguments. This makes it read from where it left off after the last read.

Listing 12

pub fn create(
    fd: RawFd,
    mut buf: Buf,
    offset: Option<libc::off_t>,
) -> io::Result<(OpState, Entry)> {
    let (buf_ptr, buf_len) = buf.io_ptr_and_len();
    let entry = opcode::Read::new(types::Fd(fd), buf_ptr, buf_len)
        .offset(offset.unwrap_or(-1))
        .build();
    Ok((ReadOwnedOp { buf }.into(), entry))
}

In Listing 13, op is the OpState and entry is the Entry returned from Listing 12. This listing shows how the reactor saves the OpState to be retrieved when the operation is complete and how it submits the Entry to io_uring. On lines 1-2 a unique operation ID for this operation is generated. On line 3 the OpState is saved in the reactor with its ID. On lines 6-10 this operation's waker is saved with the operation ID, so that the task waiting on this operation can be woken up when it's done. Finally, on lines 11-17 the Entry for this operation is added to io_uring's submission queue.

Listing 13

let op_id = self.op_id_alloc.take_id();
self.ops.insert(op_id, op);
let entry =entry.user_data(op_id as _);
let mut wakers = self.wakers.lock();
wakers.insert(op_id, cx.waker().to_owned());
let mut sq = self.uring.sq.lock();
sq.sync();
unsafe {
    sq.push(&entry)
}
sq.sync();
self.uring.submitter.submit()?;

It is now the job of the driver to complete this operation when the I/O is finished. For the purposes of this example, we are going to assume the driver is running as a separate thread. It can also run in the main thread in single-threaded programs. The driver is described in more depth in Section "The Three Parts of the faering Library". However, this is more complex so we will not assume this case for this example. In our case, the driver thread simply calls the reactor's react method in an infinite loop. So we look at the react method next.

Listing 14 shows the most important bits of the react method. Lines 3-5 check if io_uring's completion queue is empty. If it is, there are no operations for us to complete at the moment, so we wait until there is at least one. Once there is at least one entry in the completion queue, we remove each entry, get its ID, look up the accompanying OpState in the reactor, and save the entry's return code in the OpState. The OpState having its return code signals to other parts of the code that this operation is complete. Finally, lines 13-14 look up the waker for this operation in the reactor and call its wake method, waking up the task that is waiting on this I/O.

Listing 14

let mut cq = self.uring.cq.lock();
cq.sync();
if cq.is_empty() {
    self.uring.submitter.submit_and_wait(1)?;
}
let mut wakers = self.wakers.lock();
for entry in &mut cq {
    let op_id = entry.user_data() as OpId;
    let mut op = self.ops.get_mut(&op_id);
    op.res = Some(entry.result());
    ops_to_wake.push(op_id);
    let waker = wakers.remove(&op_id).unwrap();
    let _ = catch_unwind(|| waker.wake());
}

Now that the task waiting on this I/O has been awakened, it will call the poll method on ReadFuture from Listing 8 again, except this time it will call File's poll_read_done method as shown in Listing 15. This method similarly to start_read method calls a macro which completes the required operation type, in this case still ReadOwnedOp.

Listing 15

fn poll_read_done(&self, op_id: OpId) -> Poll<Result<(Buf, usize)>> {
    match complete_op!(ReadOwnedOp, op_id,)? {
        Some(ret) => Poll::Ready(Ok(ret)),
        None => Poll::Pending,
    }
}

Listing 16 shows the the complete_op! macro that was called in Listing 15. If the operation ID does not yet have a return code set on its OpState, or it's not a valid ID, it returns None indicating the operation is not complete. Otherwise, it retrieves the return code from the OpState, removing it from the reactor, as shown on line 6-7. Then line 13 calls this operation type's complete method, which is responsible for cleaning up any resources allocated by the operation and returning its output. In this case, complete gives us the return code for the operation and the buffer we read into. Line 16 return this operation's ID so it can be reused. Finally, the buffer and return code, which for this operation is the number of bytes read, are returned from the poll method on ReadFuture giving them to the user and completing the read.

Listing 16

macro_rules! complete_op {
    ($op:ty, $op_id:expr, $($op_args:tt)*) => {{
        use crate::reactor::{OpState, Reactor, Op};
        use std::io::Error;
        let reactor = Reactor::get();
        if let Some(op_state) = reactor.ops.get(&$op_id) {
            if op_state.res.is_some() {
                drop(op_state);
                let OpState {res, op, ..} = reactor.ops.remove(&$op_id).unwrap();
                let res = res.unwrap();
                let is_timeout = matches!(op, Op::Timeout(_));
                #[allow(unused_mut)]
                let mut op = <$op>::unwrap_op(op);
                let ret = op.complete(res, $($op_args)*);
                reactor.op_id_alloc.return_id($op_id);
                tracing::trace!("Op {} completed with value {}", $op_id, res);
                if res >= 0 || (is_timeout && res == -libc::ETIME) {
                    ret.map(Some)
                } else {
                    Err(Error::from_raw_os_error(-res))
                }
            } else {
                Ok(None)
            }
        } else {
            Ok(None)
        }
    }};
}

Listing 17 shows the complete method for ReadOwnedOp. It simply consumes the operation returning the buffer that was read into and the return code, converted into the proper type.

Listing 17

pub fn complete(self, res: i32) -> io::Result<(Buf, usize)> {
    let len = res as usize;
    Ok((self.buf, len))
}

Compatible API Example

The previous example illustrated how the copy-less API functions internally. The compatible API works very similarly internally with a few differences be outlined next. In Listing 18, we are again reading from a file. But this time we are passing a mutable reference to the read method instead of passing ownership of the buffer. This API comes from the AsyncRead trait in the futures-io crate. It is much more similar to that of the Read trait in the standard library.

Listing 18

let mut file = File::open(&path).await?;
let mut buf = Vec::with_capacity(1024);
let len = file.read(&mut buf).await?;

The read method used in Listing 18 comes from the futures-io crate so its source code is not detailed here, however it is similar to that of the read method and the ReadFuture in the previous example. The future from the futures-io crate calls the poll_read method in Listing 19. This method serves a similar purpose to the start_read and poll_read_done methods in the previous example. This method checks if there is already a read operation in progress on this file. If there is not, it will start one using the queue_op! macro like in the previous example. And if there is an operation in progress, it will poll if it's complete using the complete_op! macro. The main difference here being it uses the ReadPoolOp operation instead of ReadOwnedOp.

Listing 19

fn poll_read(mut self: Pin<&mut Self>, cx: &mut Context<'_>, buf: &mut [u8]) -> Poll<Result<usize>> {
    match self.read_op {
        Some(op_id) => match complete_op!(ReadPoolOp, op_id, buf)? {
            Some(res) => {
                self.read_op = None;
                Poll::Ready(Ok(res))
            }
            None => Poll::Pending,
        },
        None => {
            let op_id = queue_op!(ReadPoolOp, cx, self.fd, buf.len(), None)?;
            self.read_op = Some(op_id);
            Poll::Pending
        }
    }
}

The next paragraphs will show the create and complete methods for ReadPoolOp. The rest of the process is identical to the previous example, so it will not be repeated.

Listing 20 shows the create method of ReadPoolOp. Lines 6-7 get a buffer from the buffer pool to use in this operation. Then lines 8-10 construct a io_uring read operation, same as in Listing 12.

Listing 20

pub fn create(
    fd: RawFd,
    len: usize,
    offset: Option<libc::off_t>,
) -> io::Result<(OpState, Entry)> {
    let mut buf_pool = Reactor::get().buf_pool.lock();
    let (buf_idx, buf, buf_len) = buf_pool.get_raw_buf(len);
    let entry = opcode::Read::new(types::Fd(fd), buf, buf_len)
        .offset(offset.unwrap_or(-1))
        .build();
    Ok((ReadPoolOp { buf_idx }.into(), entry))
}

Listing 21 shows the complete method of ReadPoolOp. Line 3 retrieve the buffer that was read into from the buffer pool. Then lines 4-7 copy the contents of that buffer into the user's buffer. Later code also marks this buffer from the buffer pool as able to be reused.

Listing 21

pub fn complete(&self, res: i32, buf: &mut [u8]) -> io::Result<usize> {
    let len = res as usize;
    let buf_pool = Reactor::get().buf_pool.lock();
    if res > 0 {
        let kbuf = buf_pool.buf_alloc.get(self.buf_idx).unwrap();
        buf[..len].copy_from_slice(&kbuf[..len]);
    }
    Ok(len)
}

Implementation Wrap Up

In this section the details of the faering implementation were explored. If the reader wishes to gain further insight into the inner workings of the library, they should refer to the library's API documentation and source code¹⁸. The next section will discuss the results and conclusions of running benchmark tests on the library.

Experimental Results

To assess the current performance of faering, two benchmarks were created to measure different aspects of the library's performance. Identical benchmarks were also created using the smol async runtime as a point of comparison¹⁴. For the first benchmark an HTTP server serves a static web page from memory. This benchmark measures network I/O performance, and we used an HTTP benchmarking tool called wrk⁶.

Listing 22 contains the HTTP server benchmark code for faering and Listing 23 contains the code for smol.

Listing 22

fn main() -> std::io::Result<()> {
    let ex = Executor::new();
    futures_lite::future::block_on(ex.run(async {
        let listener = TcpListener::bind("127.0.0.1:8080")?;
        let mut incoming = listener.incoming();
        while let Some(stream) = incoming.next().await {
            let mut stream = stream?;
            ex.spawn(async move {
                let buf = Buf::new(1024);
                let _ = stream.read(buf).await;
                let _ = stream
                .write_all(RESP.as_bytes())
                .await;
                let _ = stream.close().await;
            }).detach();
        }
        Ok(())
    }))
}

Listing 23

fn main() -> std::io::Result<()> {
    let ex = Executor::new();
    futures_lite::future::block_on(ex.run(async {
        let listener = TcpListener::bind("127.0.0.1:8080").await?;
        let mut incoming = listener.incoming();

        while let Some(stream) = incoming.next().await {
            let mut stream = stream?;
            ex.spawn(async move {
                let mut buf = Vec::with*capacity(1024);
                let _ = stream.read(&mut buf).await;
                let _ = stream.write_all(RESP.as_bytes()).await;
                let _ = stream.close().await;
            }).detach();
        }
        Ok(())
    }))
}

For the second benchmark one megabyte of data was copied from /dev/zero to /dev/null. This benchmark measures regular file I/O performance. Virtual files were used for these measurements to eliminate disk performance as a variable. However, adding more benchmarks that do use files on an actual disk in the future would be valuable. This file copy test was measured using the Rust benchmarking framework criterion⁷.

Listing 24 contains the file copy benchmark code for faering and Listing 25 contains the code for smol.

Listing 24

fn file_copy_zero_to_null(c: &mut Criterion) {
    c.bench_function("fs_read", |b| {
        b.iter(|| {
            futures_lite::future::block_on(async {
                let mut file = File::open("/dev/zero").await.unwrap();
                let buf = Buf::new(1024 * 1024 * 1024);
                let data = file.read_exact(buf).await.unwrap();
                let mut file_other = File::open("/dev/null").await.unwrap();
                file_other.write_all(data).await;
            });
        })
    });
}

criterion_group!(fs, file_copy_zero_to_null);
criterion_main!(fs);

Listing 25

fn fs_read(c: &mut Criterion) {
    c.bench_function("fs_read", |b| {
        b.iter(|| {
            futures_lite::future::block_on(async {
                let mut file = File::open("/dev/zero").await.unwrap();
                let mut buf = Vec::with_capacity(1024 * 1024 * 1024);
                file.read_exact(&mut buf).await.unwrap();
                let mut file = File::open("/dev/null").await.unwrap();
                file.write_all(&buf).await.unwrap();
            });
        })
    });
}
criterion_group!(fs, fs_read);
criterion_main!(fs);

As you can see from the benchmark results graph in Figure 6, faering's performance is currently quite bad. In the HTTP server benchmark, faering was able to service only 70 requests per second, whereas smol serviced 24K requests per second. In the file copy benchmark, faering took 259K microseconds to copy the file, whereas smol took only 32 microseconds. io_uring should perform better than the mechanisms smol uses (epoll + non-blocking I/O), so this result is surprising. faering was producing many "bad file descriptor" errors during the benchmarks. It is probable that whatever bug was causing these errors is at least partly responsible for faering's poor performance.

Figure 6: Bar chart of the relative performance between faering and smol on two benchmarks

The next section will discuss the shortcomings of the current implementation and potential ways to improve upon them.

Future Work

As shown in the experimental results, faering is not currently ready for real use. In the short term, the next steps are to fix the performance issues and whatever bugs are leading to them. While running the benchmarks, many of the write operations were returning "bad file descriptor" errors. These errors were only happening during benchmarking and not during testing, so it is likely that running many parallel operations is corrupting the state of the reactor or otherwise causing problems. It is possible that this is the primary cause of the performance issues, however other inefficiencies may also be causing problems.

Currently, the submission end of the reactor is controlled by a mutex so only one I/O event can be submitted at a time. In situations with high parallelism, this may be leading to lock-contention which could be hurting performance. This could be remedied by adopting a lock-free approach like using a channel to submit operations. Doing so would likely require substantial refactoring.

Before any performance optimizations are made, more benchmarks need to be written so that it can be determined if performance actually improves from the changes made. Currently there are only a handful of benchmarks. More will need to added to ensure that all aspects of the library are adequately measured.

Many of the limitations of faering are due to limitations in Rust itself. Logically speaking, it should be possible for a future that is polling an io_uring-based I/O operation to just hold onto a mutable reference to the buffer it's modifying instead of taking ownership of that buffer, because the future should live for the entire I/O operation. The problem is that the user can cancel the future before I/O is complete by dropping it. If the user does that, the mutable reference to the buffer would also be dropped allowing the user to modify the buffer. But the io_uring will not immediately stop when the future is cancelled, potentially leading to the user and the kernel mutating the buffer at the same time. This is a problem because in the current Rust, all futures are cancelable. There has been discussion in the Rust language development circles about changing this, but such a change would be fairly substantial and is not likely to emerge in the near future⁵.

As a result of the current limitations of async Rust, faering currently does not support vectored I/O. The buffer pool method that the compatible API uses would destroy any performance benefit of vectored I/O and a solution similar to the copy-less API is also not possible because vectored I/O must be passed by reference, not by ownership. Support for non-cancelable futures would fix this problem.

Having no performant and standardized API in faering is not optimal. There is also discussion in the Rust development circles about creating a standardized async I/O API that would work for completion APIs. If such a standardized API emerges in the future, faering should adopt it⁵.

Prior Art

The other most promising attempt at creating an io_uring library for Rust to date is tokio-uring¹⁷. They are implementing a fully safe API and have taken an approach similar to this project's copy-less API. While they seem to have implemented fewer features than faering as of release 0.3.0, this library is likely to progress, as tokio is a large and well-established project. A major difference between tokio-uring and faering is that tokio-uring has a built-in executor while faering relies on an external one.

To date, the only other previous attempt the author is aware of to create an entirely safe library around io_uring for Rust is ringbahn which hasn't seen an update in over two years¹¹. The ringbahn author's blog posts about their difficulties implementing an io_uring interface for Rust were one of the main inspirations for faering. Other more recent attempts at creating an io_uring library for Rust, like rio and iou, do not guarantee memory safety if the user mutates a buffer while io_uring is operating on it¹²⁻¹³.

Of the existing async runtimes, faering has taken the most inspiration from smol¹⁴. The author studied its source code to help inform the design of faering. faering also adopts the more modular nature of smol by not including an executor. faering can be used with any executor, as smol's async-io library can be used with any executor like smol's async-executor library. It may be useful for the author and any future contributors to also study the inner workings of the tokio and async-std libraries to gain more insights into ways to improve faering¹⁵⁻¹⁶.

Conclusion

This project explored both the new io_uring interface in Linux and the internals of how Rust async runtimes are implemented. faering attempted to solve the API incompatibility between futures-io and io_uring as to provide the most useful library possible to its users.

While faering does provide two different ways to solve this API incompatibility, this is ultimately a compromise solution. Optimally, it would be possible to create one unified API for async I/O that would work for both readiness and completion APIs, without compromising ergonomics or performance. However, such a unified API currently does not exist and may not be possible in current Rust.

The current implementation of faering is incomplete but very near to a usable state. The author's future goal is to advance faering to be usable in a real production context. Even if it never finds wide adoption, hopefully it can at least help to inform the designs of future libraries.

References

1. Asynchronous Programming Under Linux

2. What is io_uring?

3. Welcome to Lord of the io_uring

4. The Future trait documentation

5. Async IO with completion-model IO systems

6. wrk http benchmarking tool

7. criterion Rust benchmarking framework

18. faering source code