Skip to main content

File I/O and mmap Workshop

Retrieval Prompts

  1. Write the five raw I/O syscalls and their signatures from memory.
  2. Explain, in one sentence each, what O_APPEND, O_CLOEXEC, O_NONBLOCK, and O_TRUNC do.
  3. State two reasons read might return fewer bytes than requested.
  4. Write the four-argument form of mmap you would use to read-map a file, and say what each argument does.
  5. State the difference between MAP_SHARED and MAP_PRIVATE, both for reads and for writes.

Compare and Distinguish

  • read/write vs fread/fwrite
  • mmap of a file vs mmap with MAP_ANONYMOUS
  • lseek(fd, 0, SEEK_END) vs fstat(fd, &st).st_size
  • msync vs fsync
  • ftruncate to grow a file vs lseek + write 1 byte to grow it

Common Mistake Check

  1. Writing a cat that assumes read(fd, buf, 4096) always returns 4096 until EOF.
  2. Calling lseek on a pipe and treating the ESPIPE error as "the file is weird."
  3. Calling open(path, O_CREAT | O_WRONLY) without a mode_t, getting random-garbage permissions.
  4. mmap-ing a 10 GB file and assuming RSS will be 10 GB.
  5. Modifying a MAP_PRIVATE | PROT_WRITE mapping and expecting the file on disk to change.

Mini Application: Implement cat

Requirements:

  1. Use only open, read, write, close.
  2. Loop both read and write to handle short transfers.
  3. Handle EINTR by retrying.
  4. With no arguments, copy stdin. With one or more paths, concatenate them to stdout.
  5. Byte-for-byte match against system cat on at least /etc/hostname and /usr/share/dict/words.

Mini Application: Implement wc -l

Requirements:

  1. Use only the raw I/O syscalls.
  2. Count newlines by inspecting each byte.
  3. Accept multiple files and print a total, matching the system wc -l format.
  4. Handle input from a pipe (some | ./wcl) -- no lseek.

Build a program mgrep PATTERN FILE that:

  1. Opens FILE and fstats it.
  2. mmaps it read-only.
  3. Walks the mapping, printing every line containing PATTERN (a literal, not a regex).
  4. Prints the byte offset of each match (so you can prove you are walking the mapping, not re-reading).

Compare its runtime to grep -F PATTERN FILE on a large text file.

Scenarios

  1. A logging program opens a file with O_APPEND and forks; both processes call write. Interleaving is always clean at the line boundary. Why?
  2. A program mmaps a 4 GB file on a 4 GB RAM machine and works fine; the same program, read-ing the file into one big buffer, is OOM-killed. Why?
  3. A program modifies a MAP_SHARED mapping and calls exit. Another process opens the file and sees the old bytes. What was missing?
  4. A program reads a log file with read(fd, buf, 1) and is CPU-bound. strace -c shows 99% of time in read. Explain and fix.
  5. A program mmaps a file, calls ftruncate(fd, new_smaller_size), then reads through the pointer past the new size. It crashes. Why?

Evidence Check

Complete when your cat and wc -l match the system versions on three non-trivial inputs, and your mgrep finishes faster than read-based equivalents on a 1 GB file.