File Descriptors, Open-File Tables, and Reference Counting
What This Concept Is
When a process calls open, the kernel returns a small non-negative integer: a file descriptor (fd). That integer is an index into a per-process table. Behind it sit two more tables, shared across processes. Three layers total:
per-process FD table system-wide open-file table in-memory inode table
+-------+---------+ +--------+-------+-------+ +-------+--------+
| fd=3 | ----> |----------->| offset | flags | cnt=2 |-------->| inode | cnt=1 |
| fd=4 | ... | +---->|--------+-------+-------| |-------+--------|
+-------+---------+ | | offset | flags | cnt=1 |-------->| inode | cnt=1 |
| +--------+-------+-------+ +-------+--------+
another process's table |
+-------+---------+ |
| fd=5 | -------+------+
+-------+---------+
- The FD table is per-process.
fd=3in process A andfd=3in process B are unrelated. - The open-file table is system-wide. Each entry holds the current byte offset, the open mode (
O_RDONLY,O_APPEND, ...), and a reference count. Multiple FDs (same process viadup, different processes viafork) can point to one entry. - The in-memory inode table is also system-wide. Each entry caches the on-disk inode and has its own reference count.
Reference counts drive lifetimes. When the FD table entry is released (close), it decrements the open-file entry's count. When that hits zero, the open-file entry decrements the inode's count. When the inode's open count hits zero and the on-disk link count is zero, the file is truly deleted.
Why It Matters Here
This three-layer structure explains several otherwise-strange Unix behaviors:
forkduplicates the FD table, but parent and child share the open-file entry. Seeking in the parent moves the child's position too.dup(3)createsfd=4that shares the open-file entry withfd=3, so writes through either advance the same offset.- An unlinked open file disappears from its directory but lives until the last FD referring to it closes. This is how
/tmptemp files work and why deleting a logfile while it is open does not free space.
Syscalls that take fd (not pathname) bypass path resolution entirely; that is why read/write are cheap.
Concrete Example
int fd = open("/tmp/a", O_WRONLY | O_APPEND); // fd=3 -> OF -> inode
pid_t pid = fork();
// both parent and child now have fd=3, SAME OF entry
write(fd, "x", 1); // from parent: offset advances
write(fd, "y", 1); // from child: offset is after "x"
After both writes, the file contains "xy" (in some order) because both processes share the offset.
Contrast with two separate open calls:
int a = open("/tmp/a", O_WRONLY); // fd=3 -> OF1 -> inode
int b = open("/tmp/a", O_WRONLY); // fd=4 -> OF2 -> inode
write(a, "x", 1); // OF1.offset = 1
write(b, "y", 1); // OF2.offset = 1 (overwrites OF1's write!)
Two opens of the same file give independent offsets and independent flags. The second write clobbers the first because both start at offset 0.
Common Confusion / Misconception
"fd is the file." No. fd is an integer slot in a per-process table. The file is the inode; the in-between layer is the open-file entry.
"Closing a file releases it." Not if another FD (or another process via fork) still holds the open-file entry. Counts must reach zero.
"lseek changes the file." It does not. It changes the offset in the open-file entry, which is shared or independent depending on how the FDs were created.
How To Use It
For any I/O scenario with multiple processes or threads, draw the three tables. Ask:
- Do these two FDs share an open-file entry (
forkordup), or point to independent ones (twoopencalls)? - What is the link count on the inode? What is the open count?
- If I
closeone FD, what happens to the counts?
This rules out most race-condition confusions around shared file state.
Check Yourself
- Why can a process still write to a file whose directory entry was deleted five seconds ago?
- Why do two
opencalls on the same path not share a byte offset, butfork+ inheritedfddoes? - What does
dup2(old, new)guarantee atomically? Why is it useful for shell redirection?
Mini Drill or Application
Write a short C or Python program:
open/tmp/afor writing withO_TRUNC.fork.- In both parent and child, write
"A"10 times and"B"10 times respectively. - Run it. Inspect the file. Explain what the byte sequence tells you about the shared offset.
Repeat but with parent and child each doing their own open. Explain the different output.