Skip to main content

File Descriptors, Open-File Tables, and Reference Counting

What This Concept Is

When a process calls open, the kernel returns a small non-negative integer: a file descriptor (fd). That integer is an index into a per-process table. Behind it sit two more tables, shared across processes. Three layers total:

 per-process FD table          system-wide open-file table        in-memory inode table
+-------+---------+ +--------+-------+-------+ +-------+--------+
| fd=3 | ----> |----------->| offset | flags | cnt=2 |-------->| inode | cnt=1 |
| fd=4 | ... | +---->|--------+-------+-------| |-------+--------|
+-------+---------+ | | offset | flags | cnt=1 |-------->| inode | cnt=1 |
| +--------+-------+-------+ +-------+--------+
another process's table |
+-------+---------+ |
| fd=5 | -------+------+
+-------+---------+
  • The FD table is per-process. fd=3 in process A and fd=3 in process B are unrelated.
  • The open-file table is system-wide. Each entry holds the current byte offset, the open mode (O_RDONLY, O_APPEND, ...), and a reference count. Multiple FDs (same process via dup, different processes via fork) can point to one entry.
  • The in-memory inode table is also system-wide. Each entry caches the on-disk inode and has its own reference count.

Reference counts drive lifetimes. When the FD table entry is released (close), it decrements the open-file entry's count. When that hits zero, the open-file entry decrements the inode's count. When the inode's open count hits zero and the on-disk link count is zero, the file is truly deleted.

Why It Matters Here

This three-layer structure explains several otherwise-strange Unix behaviors:

  • fork duplicates the FD table, but parent and child share the open-file entry. Seeking in the parent moves the child's position too.
  • dup(3) creates fd=4 that shares the open-file entry with fd=3, so writes through either advance the same offset.
  • An unlinked open file disappears from its directory but lives until the last FD referring to it closes. This is how /tmp temp files work and why deleting a logfile while it is open does not free space.

Syscalls that take fd (not pathname) bypass path resolution entirely; that is why read/write are cheap.

Concrete Example

int fd = open("/tmp/a", O_WRONLY | O_APPEND);   // fd=3 -> OF -> inode
pid_t pid = fork();
// both parent and child now have fd=3, SAME OF entry
write(fd, "x", 1); // from parent: offset advances
write(fd, "y", 1); // from child: offset is after "x"

After both writes, the file contains "xy" (in some order) because both processes share the offset.

Contrast with two separate open calls:

int a = open("/tmp/a", O_WRONLY);   // fd=3 -> OF1 -> inode
int b = open("/tmp/a", O_WRONLY); // fd=4 -> OF2 -> inode
write(a, "x", 1); // OF1.offset = 1
write(b, "y", 1); // OF2.offset = 1 (overwrites OF1's write!)

Two opens of the same file give independent offsets and independent flags. The second write clobbers the first because both start at offset 0.

Common Confusion / Misconception

"fd is the file." No. fd is an integer slot in a per-process table. The file is the inode; the in-between layer is the open-file entry.

"Closing a file releases it." Not if another FD (or another process via fork) still holds the open-file entry. Counts must reach zero.

"lseek changes the file." It does not. It changes the offset in the open-file entry, which is shared or independent depending on how the FDs were created.

How To Use It

For any I/O scenario with multiple processes or threads, draw the three tables. Ask:

  1. Do these two FDs share an open-file entry (fork or dup), or point to independent ones (two open calls)?
  2. What is the link count on the inode? What is the open count?
  3. If I close one FD, what happens to the counts?

This rules out most race-condition confusions around shared file state.

Check Yourself

  1. Why can a process still write to a file whose directory entry was deleted five seconds ago?
  2. Why do two open calls on the same path not share a byte offset, but fork + inherited fd does?
  3. What does dup2(old, new) guarantee atomically? Why is it useful for shell redirection?

Mini Drill or Application

Write a short C or Python program:

  1. open /tmp/a for writing with O_TRUNC.
  2. fork.
  3. In both parent and child, write "A" 10 times and "B" 10 times respectively.
  4. Run it. Inspect the file. Explain what the byte sequence tells you about the shared offset.

Repeat but with parent and child each doing their own open. Explain the different output.

Read This Only If Stuck