Asynchronous I/O: aio_* and io_uring
What This Concept Is
Readiness-based I/O (epoll) tells you when to do an operation; you still issue it synchronously. Completion-based asynchronous I/O has the kernel perform the operation itself in the background and hand you the result when it is done. The application submits requests and later harvests completions.
Two Linux APIs:
- POSIX
aio_*(aio_read,aio_write,aio_suspend, ...): the original async API. On Linux it is mostly implemented by glibc as a user-space thread pool and is rarely used in practice. Linux's native kernel AIO (io_submit,io_getevents) works only withO_DIRECTand has many limitations. io_uring(Linux 5.1+): a modern async I/O interface built around two shared-memory ring buffers between user space and kernel: a submission queue (SQ) and a completion queue (CQ). Applications push submission queue entries (SQEs) and pull completion queue entries (CQEs) without syscalls in the fast path.
user space kernel
+----------+ +-----------+
| app | | SQ poller |
| | SQE -> | |
| | | does |
| | | actual |
| | <- CQE | I/O |
+----------+ +-----------+
submission queue (ring) completion queue (ring)
<- head / tail shared -> <- head / tail shared ->
Features that make io_uring win:
- No syscalls in fast path (SQ poll mode); submission is a memory write + fence.
- Batched submission: many operations per
io_uring_enter. - Supports almost every syscall:
read,write,accept,connect,send,recv,fsync,close,openat,statx, ... all as SQE op codes. - Linked SQEs: chain operations so one completes before the next starts.
- Buffer registration / fixed files for zero-copy / zero-refcount hot paths.
Why It Matters Here
Three trends made async I/O central:
- NVMe latency is too low for readiness-based models to fully exploit (you are done by the time
epoll_waitwakes up). - Syscall overhead (Meltdown/Spectre mitigations) makes per-op syscalls costly; batching amortizes them.
- High-concurrency servers want to keep the CPU in user space doing useful work, not ping-ponging across the kernel boundary.
io_uring is now the standard for high-throughput I/O: RocksDB, ScyllaDB, nginx, Varnish, and many others use it. Kernel 5.10+ is effectively the floor.
Concrete Example
Submitting a read with io_uring (liburing pseudocode):
struct io_uring ring;
io_uring_queue_init(64, &ring, 0);
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, len, offset);
sqe->user_data = (uintptr_t) my_request_ctx;
io_uring_submit(&ring); // tell kernel "new SQEs"
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe); // block until one completion
if (cqe->res < 0) { /* -errno */ }
else { /* cqe->res = bytes read */ }
io_uring_cqe_seen(&ring, cqe);
With SQ polling and many in-flight SQEs, the io_uring_enter syscall disappears from the hot path; submission and completion become shared-memory ring manipulation. On modern NVMe this can push a single core past 1M IOPS.
Contrast with POSIX aio_*:
struct aiocb cb = { .aio_fildes = fd, .aio_buf = buf, .aio_nbytes = len, .aio_offset = offset };
aio_read(&cb);
while (aio_error(&cb) == EINPROGRESS) { /* do other work or aio_suspend */ }
ssize_t r = aio_return(&cb);
Looks clean. In glibc it is a thread pool; in kernel AIO it requires O_DIRECT and has quirky semantics. Almost all modern code prefers io_uring.
Common Confusion / Misconception
"epoll is asynchronous and io_uring is just faster." They are different: epoll is readiness-based, io_uring is completion-based. With io_uring you do not "be told that the FD is ready"; you submit the operation and are told the result.
"io_uring is always faster than epoll." For idle-heavy connection-count workloads it is about the same (both are O(ready)). io_uring wins on high-IOPS, batched, or NVMe-saturating workloads. It also wins when you want to chain operations (e.g., accept -> recv -> send).
"io_uring is a security hazard." Early versions had sandbox-escape issues that led some cloud providers to disable it. Modern kernels have improved safety. Check your distro's defaults; for general application code this is not a blocker.
How To Use It
Pick io_uring when:
- You need per-core throughput on NVMe or fast networks.
- You have naturally batched or pipelined operations.
- You want to avoid per-op syscall cost.
Stick with epoll + non-blocking I/O when:
- Your workload is dominated by idle connections and small I/O.
- You need portability across non-Linux (BSD, macOS: use
kqueue). - Your language runtime (Go, Java NIO, Python asyncio) already wraps
epolland you cannot drop down.
For most cases, the advice is: learn epoll first (concept 13); reach for io_uring when a measured bottleneck is in the syscall path or you need more parallelism than readiness allows.
Check Yourself
- Why is "completion-based" different from "readiness-based," and why does the difference matter under NVMe latency?
- What does SQ polling mode buy you, and what does it cost?
- Why can
io_uringmakecloseoropenatgo faster even though those are traditionally cheap?
Mini Drill or Application
Using liburing:
- Write a program that submits 1,000
readSQEs against a file in parallel. - Harvest all completions and report total time.
- Compare with a synchronous
readloop over the same offsets. - Compare with a 16-thread parallel
readimplementation.
Measure CPU utilization and throughput for each.