C++ Profiling | Finding Bottlenecks with perf and gprof When You Don’t Know What’s Slow
이 글의 핵심
Practical C++ profiling: perf, gprof, flame graphs, chrono timers, and how to find real bottlenecks instead of guessing.
Introduction: “I don’t know what’s slow”
Real-world scenarios
What often happens:
- You spend three days optimizing a function you “thought” was slow; the real bottleneck was file I/O.
- perf report shows ??? for symbols and you can’t analyze.
- You profile with gprof but gmon.out never appears.
- Valgrind runs 30× slower and feels impractical.
- Your API server is at 100% CPU and you don’t know which handler is hot.
- After 24 hours, memory grows from 2GB to 8GB (possible leak).
- Your algorithm is O(n) but gets much slower than linear as n grows (cache effects suspected).
In these situations, measurement beats guessing. Use a profiler to find hotspots, visualize with flame graphs, optimize the top ~20% of time first—that usually gives the best return.
Optimizing from guesses wastes time
The program felt slow, so you optimized from intuition. The real bottleneck (the part that limits overall performance) was elsewhere.
Wrong approach:
// “This function must be slow” — optimize it
void processData(std::vector<int>& data) {
// complex optimization attempts...
}
// In reality this was the bottleneck
void loadData() {
// file I/O is slow
}
After profiling:
processData: ~5% of timeloadData: ~80% of time ← real bottleneck
Lessons:
- Don’t guess—measure
- Find bottlenecks with a profiler
- Optimize the slowest parts first
Profiling means measuring which functions use how much CPU or memory at runtime. Without it, “this part feels slow” often points at the wrong layer—I/O or another module may dominate. Use CPU sampling (e.g. perf) or instrumentation first to see where time goes, then optimize the top few percent.
End-to-end profiling flow
flowchart TD
A[Program is slow] --> B[Guess without measuring]
B --> C{Hit the bottleneck?}
C -->|No| D[Wasted time]
A --> E[Run profiling]
E --> F[Find hotspots]
F --> G[Optimize top ~20%]
G --> H[Re-measure]
H --> I{Goal met?}
I -->|No| E
I -->|Yes| J[Done]
After reading this article you will:
- Use profiling tools effectively
- Pinpoint bottlenecks accurately
- Measure performance quantitatively
- Optimize effectively in practice
Table of contents
- What is profiling
- Basic timing
- Profiling tools
- Complete profiling example
- Bottleneck analysis
- Practical optimization process
- Common problems
- Profiling best practices
- Production profiling patterns
- Checklist
1. What is profiling
Why measure performance
“Don’t guess—measure.”
- Intuition is often wrong
- Bottlenecks hide in unexpected places
- Optimization without measurement wastes time
Kinds of profiling
1. CPU profiling
- Which functions use the most CPU
- Call counts and time spent
2. Memory profiling
- Memory usage
- Allocation/deallocation counts
- Leaks
3. Cache profiling
- Cache miss counts
- Access patterns
Profiling categories at a glance
flowchart LR
subgraph CPU["CPU profiling"]
C1[perf]
C2[gprof]
C3[VS Profiler]
end
subgraph MEM["Memory profiling"]
M1[Valgrind Memcheck]
M2[AddressSanitizer]
end
subgraph CACHE["Cache profiling"]
K1[Valgrind Cachegrind]
K2[perf stat]
end
2. Basic timing
Measuring with std::chrono
Since C++11, std::chrono can measure intervals. Take high_resolution_clock::now() at start and end, subtract to get a duration, then duration_cast to milliseconds or microseconds. That turns “feels slow” into numbers.
// After pasting: g++ -std=c++17 -o profile_time profile_time.cpp && ./profile_time
#include <chrono>
#include <iostream>
void slowFunction() {
// heavy work...
}
int main() {
auto start = std::chrono::high_resolution_clock::now();
slowFunction();
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "Time: " << duration.count() << " ms\n";
return 0;
}
Sample output: Time: N ms (N depends on the environment).
Details:
high_resolution_clock: finest clock availablenow(): current time astime_pointduration_cast: convert e.g. to millisecondscount(): integer value in that unit
RAII timer helper
Record time in the constructor and print elapsed time in the destructor—classic RAII timer. { Timer t("slowFunction"); slowFunction(); } prints when the scope ends. Exceptions and early returns still run the destructor, so you miss fewer “end times” than manual prints.
class Timer {
std::chrono::high_resolution_clock::time_point start;
const char* name;
public:
Timer(const char* n) : name(n) {
start = std::chrono::high_resolution_clock::now();
}
~Timer() {
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
std::cout << name << ": " << duration.count() << " us\n";
}
};
void processData() {
Timer timer("processData");
// work...
} // prints automatically in destructor
Note: Keep the Timer in the right scope—use { } blocks so the measured region is clear.
Multiple sections
class Profiler {
std::map<std::string, long long> timings;
std::chrono::high_resolution_clock::time_point start;
public:
void startTimer() {
start = std::chrono::high_resolution_clock::now();
}
void record(const std::string& name) {
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
timings[name] += duration.count();
start = end;
}
void report() {
for (const auto& [name, time] : timings) {
std::cout << name << ": " << time << " us\n";
}
}
};
int main() {
Profiler prof;
prof.startTimer();
loadData();
prof.record("loadData");
processData();
prof.record("processData");
saveData();
prof.record("saveData");
prof.report();
}
Usage: Each record() adds the time since the previous record() (or startTimer()). start = end advances to the next segment; repeat to accumulate totals.
3. Profiling tools
Choosing a tool
flowchart TD
A[Need profiling] --> B{Platform?}
B -->|Linux| C[perf]
B -->|Linux/Mac| D[gprof]
B -->|Linux/Mac| E[Valgrind]
B -->|Windows| F[VS Profiler]
C --> G[CPU sampling]
D --> H[Instrumentation]
E --> I[Memory/cache]
F --> C
perf (Linux)
The standard Linux profiler. Sampling records which function is on-CPU periodically—low overhead, usable even in production-like settings.
# Profile while running
perf record ./myapp
# View results
perf report
# Per-function stats
perf stat ./myapp
Example output:
50.23% myapp [.] processData
30.45% myapp [.] loadFile
15.32% myapp [.] parseJson
perf report tips:
# Include call graph
perf record -g ./myapp
# Text report
perf report --stdio
# Filter symbol
perf report --symbol-filter=processData
Interpreting perf stat:
Performance counter stats for './myapp':
1,234.56 msec task-clock
42 context-switches
0 cpu-migrations
128 page-faults
3,456,789,012 cycles
2,345,678,901 instructions
task-clock: CPU time (ms)context-switches: context switch countpage-faults: page fault countcycles,instructions: hardware counters
IPC (instructions per cycle): instructions / cycles near 1 suggests efficient CPU use; well below 0.5 may indicate memory stalls or bad branch prediction.
gprof (GNU profiler)
Compile with -pg to inject profiling code. Running produces gmon.out; gprof reports per-function time and call counts.
g++ -pg -O2 main.cpp -o myapp
./myapp
gprof myapp gmon.out
Sample gprof output:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
80.0 0.80 0.80 1 800.00 800.00 loadFile
15.0 0.95 0.15 100 1.50 1.50 processData
5.0 1.00 0.05 1 50.00 50.00 saveResult
Note: -pg with -O2 can inline and merge functions—use -O0/-O1 if you need clearer call relationships.
Valgrind Callgrind
Simulates execution step by step—accurate call graphs and cache info, but 10–50× slower—use on short runs only.
valgrind --tool=callgrind ./myapp
callgrind_annotate callgrind.out.12345
# GUI: kcachegrind
Options:
valgrind --tool=callgrind --cache-sim=yes ./myapp
valgrind --tool=callgrind --toggle-collect=processData ./myapp
Visual Studio Profiler
1. Debug → Performance Profiler
2. CPU Usage
3. Start, run app
4. Inspect Hot Path and per-function time
Tool comparison
| Tool | Platform | Method | Overhead | Production |
|---|---|---|---|---|
| perf | Linux | Sampling | Low (~5%) | Yes |
| gprof | Linux/Mac | Instrumentation | Medium (~10%) | Sometimes |
| Valgrind | Linux/Mac | Simulation | Very high (10–50×) | No |
| VS Profiler | Windows | Sampling | Low | Yes |
Flame graphs
Flame graphs stack frames bottom-up; width shows share of CPU time—great for spotting hot paths.
perf record -F 99 -g ./myapp
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
How to read:
- Width: fraction of CPU time on that path
- Height: call stack (caller below, callee above)
- Wide bars: hottest paths
Full flame graph workflow:
git clone --depth 1 https://github.com/brendangregg/FlameGraph
export PATH="$PATH:$(pwd)/FlameGraph"
perf record -F 99 -g --call-graph dwarf,8192 ./myapp
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
open flamegraph.svg # macOS
# xdg-open flamegraph.svg # Linux
Common patterns:
| Pattern | Meaning | Action |
|---|---|---|
Wide memcpy | Copy-bound | Buffer pools, zero-copy |
Wide malloc/free | Allocation cost | Pools, arenas |
Wide std::sort | Sort cost | Avoid sort, partial sort |
Wide pthread_mutex_lock | Lock wait | Smaller critical sections, lock-free where safe |
4. Complete profiling example
Target program
// profile_target.cpp — analyze with perf, gprof
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>
#include <iostream>
void processDataCacheUnfriendly(std::vector<int>& data) {
const size_t stride = 16;
for (size_t i = 0; i < data.size(); i += stride) {
data[i] = data[i] * 2 + 1;
}
}
void processDataCacheFriendly(std::vector<int>& data) {
for (size_t i = 0; i < data.size(); ++i) {
data[i] = data[i] * 2 + 1;
}
}
void sortData(std::vector<int>& data) {
std::sort(data.begin(), data.end());
}
void fillRandom(std::vector<int>& data) {
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(1, 1000000);
for (auto& v : data) {
v = dis(gen);
}
}
int main() {
const size_t N = 10'000'000;
std::vector<int> data(N);
fillRandom(data);
sortData(data);
processDataCacheUnfriendly(data);
processDataCacheFriendly(data);
return 0;
}
perf example
g++ -std=c++17 -O2 -g -o profile_target profile_target.cpp
perf record -F 99 -g --call-graph dwarf,8192 ./profile_target
perf report --stdio
perf stat -e cycles,instructions,cache-references,cache-misses ./profile_target
Sample perf report --stdio:
# 45.23% profile_target [.] sortData
# 28.10% profile_target [.] fillRandom
# 12.30% profile_target [.] processDataCacheUnfriendly
# 10.00% profile_target [.] processDataCacheFriendly
Hotspot: sortData ~45% → consider algorithm changes or removing sort.
gprof example
g++ -std=c++17 -O2 -pg -g -o profile_target_gprof profile_target.cpp
./profile_target_gprof
gprof -p profile_target_gprof gmon.out
gprof -q profile_target_gprof gmon.out
gprof profile_target_gprof gmon.out > gprof_report.txt
Reading gprof: focus on % time, self seconds, calls.
Hotspot workflow
flowchart TD
A[Run program] --> B[perf record -g]
B --> C[perf report]
C --> D{Top 3 functions?}
D --> E[Widest bar = bottleneck]
E --> F[Refine with Timer]
F --> G[Pick optimization target]
G --> H[Re-measure after fix]
5. Bottleneck analysis
Finding hotspots
// Profiling says:
// 80% - loadFile() ← bottleneck!
// 15% - processData()
// 5% - saveResult()
void loadFile(const std::string& path) {
Timer timer("loadFile");
{ Timer t("open"); file.open(path); }
{ Timer t("read"); /* read... slow here */ }
{ Timer t("parse"); /* parse... */ }
}
Call counts (simple instrumentation)
class CallCounter {
static std::map<std::string, int> counts;
std::string name;
public:
CallCounter(const char* n) : name(n) {
counts[name]++;
}
static void report() {
for (const auto& [name, count] : counts) {
std::cout << name << ": " << count << " calls\n";
}
}
};
std::map<std::string, int> CallCounter::counts;
Pareto (80/20)
~80% of runtime often comes from the top ~20% of functions.
Optimizing those first yields most of the win.
6. Practical optimization process
- Measure baseline (chrono, benchmarks)
- Profile (
perf record -g, etc.) - Optimize the real hotspot (e.g.
reservefor vectors) - Re-measure
- Repeat
Benchmarking tips
- Warm up caches before timing
- Run multiple iterations and average or take median
- Use
-O2/-O3for release-like numbers when that matches production
Memory profiling
valgrind --leak-check=full ./myapp
AddressSanitizer (faster than Valgrind for many bugs):
g++ -g -O1 -fsanitize=address -fno-omit-frame-pointer main.cpp -o myapp
./myapp
7. Common problems
perf permission denied
Lower kernel.perf_event_paranoid or run with appropriate privileges (see your distro docs).
No gmon.out
Ensure -pg and normal process exit (not only Ctrl+C/abort in some setups).
Valgrind too slow
Use smaller inputs, or use perf for CPU-only work.
Symbols show as ???
Build with -g, avoid stripping debug info.
Inlined functions disappear from profile
Try -O1/-O0 for profiling builds, or mark critical functions __attribute__((noinline)).
7. Checklist (duplicate section id in source)
Before profiling
-
-gfor symbols - Choose optimization level (
-O1often balances accuracy vs reality) - perf: check
perf_event_paranoid - gprof:
-pg - Valgrind: shrink workload
After profiling
- Identify top ~20% functions
- Drill down with timers
- Record baseline before changes
- Re-measure after changes
- Regression-test behavior
Principles
- Measure, don’t guess
- Fix big bottlenecks first
- Compare before/after
- Use profilers systematically
Related posts (internal)
- Cache-friendly C++
- Compile-time optimization
- Compiler optimization PGO/LTO
Keywords (search)
C++ profiling, perf, gprof, Valgrind, bottleneck, performance measurement, optimization, flame graph
Summary
| Tool | Platform | Role |
|---|---|---|
| perf | Linux | CPU sampling |
| gprof | Linux/Mac | Per-function time |
| Valgrind | Linux/Mac | Memory, cache |
| VS Profiler | Windows | CPU, memory |
| std::chrono | All | Manual timing |
Principles: measure first; optimize hotspots; compare before/after; use profilers.
Practical tips
Debugging
- Fix compiler warnings first
- Reproduce with a small test case
Performance
- Don’t optimize without profiling
- Define measurable goals
Code review
- Check common review feedback early
- Follow team conventions
FAQ
When is this useful in practice?
A. Finding bottlenecks with perf/gprof/Valgrind, measuring performance, and choosing what to optimize—use the article’s workflows and examples.
perf vs gprof?
A. On Linux, prefer perf (sampling, low overhead). gprof needs -pg and rebuilds. For exact call graphs on short runs, consider Callgrind.
Does profiling slow the app?
A. perf sampling is usually ~5% overhead. gprof instrumentation is higher. Valgrind is 10–50×—short runs only.
What to read first?
A. Follow previous-post links or the C++ series index.
Go deeper?
A. See cppreference and official tool documentation.
One-line summary: Use chrono and profilers to find real hotspots, then optimize. Next: cache-friendly code (#15-2).
Next: C++ practical guide #15-2: cache-friendly code
Previous: Perfect forwarding (#14-2)
Related posts
- Cache optimization
- Compile-time optimization
- Slow program causes
- Advanced profiling
- STL algorithms basics