Asynchronous I/O: `aio_*` and `io_uring`

What This Concept Is

Readiness-based I/O (epoll) tells you when to do an operation; you still issue it synchronously. Completion-based asynchronous I/O has the kernel perform the operation itself in the background and hand you the result when it is done. The application submits requests and later harvests completions.

Two Linux APIs:

POSIX aio_* (aio_read, aio_write, aio_suspend, ...): the original async API. On Linux it is mostly implemented by glibc as a user-space thread pool and is rarely used in practice. Linux's native kernel AIO (io_submit, io_getevents) works only with O_DIRECT and has many limitations.
io_uring (Linux 5.1+): a modern async I/O interface built around two shared-memory ring buffers between user space and kernel: a submission queue (SQ) and a completion queue (CQ). Applications push submission queue entries (SQEs) and pull completion queue entries (CQEs) without syscalls in the fast path.

 user space             kernel
 +----------+           +-----------+
 | app      |           | SQ poller |
 |          |  SQE ->   |           |
 |          |           |  does     |
 |          |           |  actual   |
 |          |  <- CQE   |  I/O      |
 +----------+           +-----------+

       submission queue (ring)   completion queue (ring)
       <- head / tail shared ->  <- head / tail shared ->

Features that make io_uring win:

No syscalls in fast path (SQ poll mode); submission is a memory write + fence.
Batched submission: many operations per io_uring_enter.
Supports almost every syscall: read, write, accept, connect, send, recv, fsync, close, openat, statx, ... all as SQE op codes.
Linked SQEs: chain operations so one completes before the next starts.
Buffer registration / fixed files for zero-copy / zero-refcount hot paths.

Why It Matters Here

Three trends made async I/O central:

NVMe latency is too low for readiness-based models to fully exploit (you are done by the time epoll_wait wakes up).
Syscall overhead (Meltdown/Spectre mitigations) makes per-op syscalls costly; batching amortizes them.
High-concurrency servers want to keep the CPU in user space doing useful work, not ping-ponging across the kernel boundary.

io_uring is now the standard for high-throughput I/O: RocksDB, ScyllaDB, nginx, Varnish, and many others use it. Kernel 5.10+ is effectively the floor.

Concrete Example

Submitting a read with io_uring (liburing pseudocode):

struct io_uring ring;
io_uring_queue_init(64, &ring, 0);

struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, len, offset);
sqe->user_data = (uintptr_t) my_request_ctx;

io_uring_submit(&ring);   // tell kernel "new SQEs"

struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);     // block until one completion
if (cqe->res < 0) { /* -errno */ }
else              { /* cqe->res = bytes read */ }
io_uring_cqe_seen(&ring, cqe);

With SQ polling and many in-flight SQEs, the io_uring_enter syscall disappears from the hot path; submission and completion become shared-memory ring manipulation. On modern NVMe this can push a single core past 1M IOPS.

Contrast with POSIX aio_*:

struct aiocb cb = { .aio_fildes = fd, .aio_buf = buf, .aio_nbytes = len, .aio_offset = offset };
aio_read(&cb);
while (aio_error(&cb) == EINPROGRESS) { /* do other work or aio_suspend */ }
ssize_t r = aio_return(&cb);

Looks clean. In glibc it is a thread pool; in kernel AIO it requires O_DIRECT and has quirky semantics. Almost all modern code prefers io_uring.

Common Confusion / Misconception

"epoll is asynchronous and io_uring is just faster." They are different: epoll is readiness-based, io_uring is completion-based. With io_uring you do not "be told that the FD is ready"; you submit the operation and are told the result.

"io_uring is always faster than epoll." For idle-heavy connection-count workloads it is about the same (both are O(ready)). io_uring wins on high-IOPS, batched, or NVMe-saturating workloads. It also wins when you want to chain operations (e.g., accept -> recv -> send).

"io_uring is a security hazard." Early versions had sandbox-escape issues that led some cloud providers to disable it. Modern kernels have improved safety. Check your distro's defaults; for general application code this is not a blocker.

How To Use It

Pick io_uring when:

You need per-core throughput on NVMe or fast networks.
You have naturally batched or pipelined operations.
You want to avoid per-op syscall cost.

Stick with epoll + non-blocking I/O when:

Your workload is dominated by idle connections and small I/O.
You need portability across non-Linux (BSD, macOS: use kqueue).
Your language runtime (Go, Java NIO, Python asyncio) already wraps epoll and you cannot drop down.

For most cases, the advice is: learn epoll first (concept 13); reach for io_uring when a measured bottleneck is in the syscall path or you need more parallelism than readiness allows.

Check Yourself

Why is "completion-based" different from "readiness-based," and why does the difference matter under NVMe latency?
What does SQ polling mode buy you, and what does it cost?
Why can io_uring make close or openat go faster even though those are traditionally cheap?

Mini Drill or Application

Using liburing:

Write a program that submits 1,000 read SQEs against a file in parallel.
Harvest all completions and report total time.
Compare with a synchronous read loop over the same offsets.
Compare with a 16-thread parallel read implementation.

Measure CPU utilization and throughput for each.

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​