Profiling and Tracing: `perf`, `strace`, `ltrace`, `valgrind --tool=callgrind`

What This Concept Is

Four tools, four different questions about a running program.

strace -- "which system calls is this program making?" Hooks into the kernel boundary via ptrace. Prints every syscall with arguments, return value, and timestamp. Ideal for "why is my program stuck?" (usually: waiting in read or futex).
ltrace -- "which library calls is this program making?" Like strace but for dynamic-library calls, e.g., malloc, strcpy. Works by intercepting PLT entries.
perf -- "where is this program spending CPU time?" Hardware-counter-based sampling profiler built into the Linux kernel. Extremely low overhead. perf record + perf report gives a flat and tree-structured breakdown by function.
valgrind --tool=callgrind -- "give me a deterministic instruction-level call graph." Instrumentation profiler: the program runs inside a software CPU simulator, so callgrind counts every instruction per function. Slow (10×-50×) but extremely detailed; KCachegrind visualizes the result.

Also under the valgrind umbrella: valgrind with the default tool (memcheck) is the go-to for "why is this program leaking / using uninitialized memory / reading freed memory?"

Why It Matters Here

"My program is slow" is a triage, not a diagnosis. These tools tell you what kind of slow:

Waiting on the kernel -> strace.
Calling the same library function too many times -> ltrace / perf.
Burning CPU in one hot function -> perf.
Memory leaks or corruption -> valgrind (memcheck).
Need an exact call graph for a paper or a review -> callgrind + KCachegrind.

"My program is wrong" is equally a triage. strace routinely reveals that a program is reading from the wrong file, connecting to the wrong port, or failing an open and ignoring the error.

Concrete Example

Suppose you have a program that copies a file slowly:

$ strace -c ./slow_copy big.bin out.bin
% time     seconds  usecs/call     calls    errors syscall
 98.00     0.500000          10     50000           read
  1.00     0.010000          10      1000           write
  ...

98% of the time in read with 50,000 calls vs 1,000 write calls is a buffer-size problem: the program is reading 1 byte at a time. Fix it to read 4 KB at a time and re-measure.

Another scenario: "my program spikes to 100% CPU sometimes":

$ perf record -g ./myprog
$ perf report
   +  45.2%  myprog  libc.so.6   [.] __memcpy_avx2
   +  30.1%  myprog  myprog      [.] hash_lookup

45% in memcpy inside your hash path is a hint you are copying strings that could be passed by pointer.

Detecting a leak:

$ valgrind --leak-check=full ./prog
==12345== 4,096 bytes in 1 blocks are definitely lost in loss record 1 of 1
==12345==    at 0x...: malloc
==12345==    by 0x...: make_buffer (prog.c:12)
==12345==    by 0x...: main (prog.c:25)

One malloc at prog.c:12 is never freed. Fixed by adding the missing free (or by using an arena, Concept 9).

Common Confusion / Misconception

"strace shows me what my program is doing." It shows what the kernel is doing for it. Between two strace lines, minutes of CPU work can elapse in user mode. Pair it with perf when you care about user-mode time.

"valgrind is a profiler." The default tool (memcheck) is not a profiler; it is a memory-error detector. valgrind --tool=callgrind is the profiler. Mixing them up leads to confusing output.

"perf is only for Linux servers." perf is Linux-specific, but it is available in WSL2. macOS has Instruments; FreeBSD has dtrace. The mental model transfers.

Another trap: measuring debug builds. -O0 -g is great for gdb but gives misleading profile numbers because the optimizer has not run. Profile with -O2 -g (keep debug info) so the profile reflects production behavior.

How To Use It

A triage tree:

Is the program stuck (not consuming CPU but not finishing)? strace -p <pid> -- which syscall is it in?
Is the program spinning (100% CPU but slow)? perf top -p <pid> -- which function?
Is the program slower than expected? perf record on a full run, perf report. Look at the top 3 symbols.
Is the program wrong / crashy / leaking? valgrind ./prog first; then gdb ./prog core on the dump.
Do you need deterministic profile data (e.g., to compare two implementations line-for-line)? valgrind --tool=callgrind, view with KCachegrind.

Check Yourself

In one sentence, what question does each of strace, ltrace, perf, and callgrind answer?
Why is perf record better than naive printf timing for finding a hot function?
Why is profiling a -O0 build misleading?

Mini Drill or Application

Do all four:

Write a 1-byte-at-a-time copy program. Run it under strace -c on a 10 MB file and record the syscall histogram. Fix the buffer to 4 KB and re-measure.
Write a program that does 1 million malloc/free of a small struct. Run under perf record -g and identify the top functions.
Write a program with a deliberate leak (allocate 1 MB and forget to free). Run valgrind --leak-check=full. Paste the offending backtrace.
Explain, in one sentence, how you would decide between strace and perf as your first tool.

Read This Only If Stuck

COD 1.4: Performance -- the conceptual backbone for "measured, not guessed"
Man page: man 1 strace
Man page: man 1 perf
Brendan Gregg: Linux perf Examples
Valgrind User Manual
Callgrind Manual

What This Concept Is​

Why It Matters Here​

Concrete Example​

Common Confusion / Misconception​

How To Use It​

Check Yourself​

Mini Drill or Application​

Read This Only If Stuck​