Profiling and Tracing: perf, strace, ltrace, valgrind --tool=callgrind
What This Concept Is
Four tools, four different questions about a running program.
strace-- "which system calls is this program making?" Hooks into the kernel boundary viaptrace. Prints every syscall with arguments, return value, and timestamp. Ideal for "why is my program stuck?" (usually: waiting inreadorfutex).ltrace-- "which library calls is this program making?" Likestracebut for dynamic-library calls, e.g.,malloc,strcpy. Works by intercepting PLT entries.perf-- "where is this program spending CPU time?" Hardware-counter-based sampling profiler built into the Linux kernel. Extremely low overhead.perf record+perf reportgives a flat and tree-structured breakdown by function.valgrind --tool=callgrind-- "give me a deterministic instruction-level call graph." Instrumentation profiler: the program runs inside a software CPU simulator, so callgrind counts every instruction per function. Slow (10×-50×) but extremely detailed; KCachegrind visualizes the result.
Also under the valgrind umbrella: valgrind with the default tool (memcheck) is the go-to for "why is this program leaking / using uninitialized memory / reading freed memory?"
Why It Matters Here
"My program is slow" is a triage, not a diagnosis. These tools tell you what kind of slow:
- Waiting on the kernel ->
strace. - Calling the same library function too many times ->
ltrace/perf. - Burning CPU in one hot function ->
perf. - Memory leaks or corruption ->
valgrind(memcheck). - Need an exact call graph for a paper or a review ->
callgrind+ KCachegrind.
"My program is wrong" is equally a triage. strace routinely reveals that a program is reading from the wrong file, connecting to the wrong port, or failing an open and ignoring the error.
Concrete Example
Suppose you have a program that copies a file slowly:
$ strace -c ./slow_copy big.bin out.bin
% time seconds usecs/call calls errors syscall
98.00 0.500000 10 50000 read
1.00 0.010000 10 1000 write
...
98% of the time in read with 50,000 calls vs 1,000 write calls is a buffer-size problem: the program is reading 1 byte at a time. Fix it to read 4 KB at a time and re-measure.
Another scenario: "my program spikes to 100% CPU sometimes":
$ perf record -g ./myprog
$ perf report
+ 45.2% myprog libc.so.6 [.] __memcpy_avx2
+ 30.1% myprog myprog [.] hash_lookup
45% in memcpy inside your hash path is a hint you are copying strings that could be passed by pointer.
Detecting a leak:
$ valgrind --leak-check=full ./prog
==12345== 4,096 bytes in 1 blocks are definitely lost in loss record 1 of 1
==12345== at 0x...: malloc
==12345== by 0x...: make_buffer (prog.c:12)
==12345== by 0x...: main (prog.c:25)
One malloc at prog.c:12 is never freed. Fixed by adding the missing free (or by using an arena, Concept 9).
Common Confusion / Misconception
"strace shows me what my program is doing." It shows what the kernel is doing for it. Between two strace lines, minutes of CPU work can elapse in user mode. Pair it with perf when you care about user-mode time.
"valgrind is a profiler." The default tool (memcheck) is not a profiler; it is a memory-error detector. valgrind --tool=callgrind is the profiler. Mixing them up leads to confusing output.
"perf is only for Linux servers." perf is Linux-specific, but it is available in WSL2. macOS has Instruments; FreeBSD has dtrace. The mental model transfers.
Another trap: measuring debug builds. -O0 -g is great for gdb but gives misleading profile numbers because the optimizer has not run. Profile with -O2 -g (keep debug info) so the profile reflects production behavior.
How To Use It
A triage tree:
- Is the program stuck (not consuming CPU but not finishing)?
strace -p <pid>-- which syscall is it in? - Is the program spinning (100% CPU but slow)?
perf top -p <pid>-- which function? - Is the program slower than expected?
perf recordon a full run,perf report. Look at the top 3 symbols. - Is the program wrong / crashy / leaking?
valgrind ./progfirst; thengdb ./prog coreon the dump. - Do you need deterministic profile data (e.g., to compare two implementations line-for-line)?
valgrind --tool=callgrind, view with KCachegrind.
Check Yourself
- In one sentence, what question does each of
strace,ltrace,perf, andcallgrindanswer? - Why is
perf recordbetter than naiveprintftiming for finding a hot function? - Why is profiling a
-O0build misleading?
Mini Drill or Application
Do all four:
- Write a 1-byte-at-a-time copy program. Run it under
strace -con a 10 MB file and record the syscall histogram. Fix the buffer to 4 KB and re-measure. - Write a program that does 1 million
malloc/freeof a small struct. Run underperf record -gand identify the top functions. - Write a program with a deliberate leak (allocate 1 MB and forget to free). Run
valgrind --leak-check=full. Paste the offending backtrace. - Explain, in one sentence, how you would decide between
straceandperfas your first tool.
Read This Only If Stuck
- COD 1.4: Performance -- the conceptual backbone for "measured, not guessed"
- Man page:
man 1 strace - Man page:
man 1 perf - Brendan Gregg: Linux perf Examples
- Valgrind User Manual
- Callgrind Manual