Why Is My C++ Program Slow? Find Bottlenecks with Profiling (perf, VS Profiler)
이 글의 핵심
Measure first: seven common slowdown causes, perf/VS workflows, ten fixable patterns (copies, allocations, AoS vs SoA), and a five-step optimize loop.
Complexity basics: arrays and lists (Big-O intuition alongside profiling).
Introduction: “The code looks correct but it’s slow”
“Same complexity, but slower than Python sometimes”
When C++ feels slow, profiling turns guesses into hotspots: functions and lines that dominate time or hardware events.
This article covers:
- Seven major causes of slowdown
- Choosing a profiler
- perf basics (Linux)
- Visual Studio Profiler (Windows)
- Ten common performance patterns
- Case studies and a five-step tuning loop
Table of contents
- Seven root causes
- Profiler guide
- perf (Linux)
- Visual Studio Profiler
- Ten performance patterns
- Case studies
- Summary
1. Seven major causes (overview)
- Wrong asymptotics (e.g. nested loops vs hash set)
- Pass-by-value of large containers
- Excessive allocations inside hot loops
- Cache-unfriendly access patterns (stride, AoS vs SoA)
- Branch-heavy unpredictable control flow
- Virtual dispatch on hot inner loops
- Inefficient string building (repeated reallocations, excessive flushing)
Each has small code examples and fixes in the original article; the remedy is almost always measure → change data layout or algorithm → measure again.
2. Profiler guide
| Platform | Tool | Notes |
|---|---|---|
| Linux | perf | Low overhead, stack + HW counters |
| macOS | Instruments | Great UI integration |
| Windows | VS Profiler | Easy CPU sampling |
| Cross | Valgrind/callgrind | Slower, no recompile for some modes |
3. perf (Linux)
perf record -g ./myapp
perf report
perf stat -e cache-misses,cache-references ./myapp
Flame graphs: fold stacks with Brendan Gregg’s FlameGraph scripts for visual hotspots.
4. Visual Studio
Debug → Performance Profiler → CPU Usage — inspect exclusive vs inclusive time and call trees.
5. Ten patterns (titles)
- Pass
const T&instead ofTfor large inputs. - Reuse buffers /
reservevectors in loops. reserve/ostringstreamfor string assembly.- Prefer
unordered_mapwhen average O(1) beats treemap. - SoA for hot fields vs AoS when you touch only part of a struct.
- Reduce virtual calls in inner loops (batch by type, CRTP, etc.—design-dependent).
- Avoid
std::endlin tight loops (forces flush); use'\n'. - Compile regexes once, not per iteration.
- Reduce lock contention with local buffers then merge.
- Prefer contiguous
vector<int>overunique_ptrper element when possible.
6. Case studies (short)
- JSON-like string building:
reservecut reallocations → large speedups. - N+1 queries: one JOIN vs per-row queries → orders of magnitude.
- Image filters: raw pixel pointer vs virtual
getPixelper pixel → fewer calls.
Five-step process
- Measure end-to-end time + profiler trace
- Identify top exclusive-time functions
- Hypothesize (allocations? copies? cache?)
- Change one thing at a time
- Re-measure; repeat until goals met
Summary
Checklist
- Algorithm class appropriate?
- Avoid large copies?
- Hot loops allocation-free after
reserve? - Cache-friendly traversal?
- Locks not dominating?
Priority
- Algorithmic improvements
- Remove copies / tighten interfaces
- Allocation reduction
- Data layout / cache
- Compiler flags last—after correctness and profiling
Related posts (internal)
- Profiling deep dive
- Performance patterns
- Cache-friendly code
- Benchmarking
Keywords
slow C++, profiling, perf, bottleneck, CPU profiler, cache miss
Practical tips
- Never optimize without a profile on realistic input.
- Compare before/after with fixed seeds and hardware when possible.
- Watch self time, not only inclusive time, to pick real hotspots.
Closing
“Slow” becomes actionable when a profiler shows where time goes. Fix algorithm + data layout + allocations first; micro-optimize only on evidence.
Next: Cache-friendly coding and SIMD articles when CPU-bound.