Why Is My C++ Program Slow? Find Bottlenecks with Profiling
이 글의 핵심
Beyond Big-O: copying, allocations, cache misses, branch mispredictions, virtual calls. Use perf and Visual Studio to find hotspots, flame graphs, and fix patterns.
Complexity basics: [arrays and lists](/en/blog/algorithm-series-01-array-list/ (Big-O intuition alongside profiling).
Introduction: “The code looks correct but it’s slow”
“Same complexity, but slower than Python sometimes”
When C++ feels slow, profiling turns guesses into hotspots: functions and lines that dominate time or hardware events. This article covers:
- Seven major causes of slowdown
- Choosing a profiler
- perf basics (Linux)
- Visual Studio Profiler (Windows)
- Ten common performance patterns
- Case studies and a five-step tuning loop
Table of contents
- Seven root causes
- Profiler guide
- perf (Linux)
- Visual Studio Profiler
- Ten performance patterns
- Case studies
- Summary
1. Seven major causes (overview)
- Wrong asymptotics (e.g. nested loops vs hash set)
- Pass-by-value of large containers
- Excessive allocations inside hot loops
- Cache-unfriendly access patterns (stride, AoS vs SoA)
- Branch-heavy unpredictable control flow
- Virtual dispatch on hot inner loops
- Inefficient string building (repeated reallocations, excessive flushing)
Each has small code examples and fixes in the original article; the remedy is almost always measure → change data layout or algorithm → measure again.
2. Profiler guide
| Platform | Tool | Notes |
|---|---|---|
| Linux | perf | Low overhead, stack + HW counters |
| macOS | Instruments | Great UI integration |
| Windows | VS Profiler | Easy CPU sampling |
| Cross | Valgrind/callgrind | Slower, no recompile for some modes |
3. perf (Linux)
perf record -g ./myapp
perf report
perf stat -e cache-misses,cache-references ./myapp
Flame graphs: fold stacks with Brendan Gregg’s FlameGraph scripts for visual hotspots.
4. Visual Studio
Debug → Performance Profiler → CPU Usage — inspect exclusive vs inclusive time and call trees.
5. Ten patterns (titles)
- Pass const T& instead of T for large inputs.
- Reuse buffers / reserve vectors in loops.
- reserve /
ostringstreamfor string assembly. - Prefer unordered_map when average O(1) beats tree map.
- SoA for hot fields vs AoS when you touch only part of a struct.
- Reduce virtual calls in inner loops (batch by type, CRTP, etc.—design-dependent).
- Avoid std::endl in tight loops (forces flush); use ‘\n’.
- Compile regexes once, not per iteration.
- Reduce lock contention with local buffers then merge.
- Prefer contiguous
vector<int>overunique_ptrper element when possible.
6. Case studies (short)
- JSON-like string building:
reservecut reallocations → large speedups. - N+1 queries: one JOIN vs per-row queries → orders of magnitude.
- Image filters: raw pixel pointer vs virtual
getPixelper pixel → fewer calls.
Five-step process
- Measure end-to-end time + profiler trace
- Identify top exclusive-time functions
- Hypothesize (allocations? copies? cache?)
- Change one thing at a time
- Re-measure; repeat until goals met
Summary
Checklist
- Algorithm class appropriate?
- Avoid large copies?
- Hot loops allocation-free after
reserve? - Cache-friendly traversal?
- Locks not dominating?
Priority
- Algorithmic improvements
- Remove copies / tighten interfaces
- Allocation reduction
- Data layout / cache
- Compiler flags last—after correctness and profiling
Related posts (internal)
- Profiling deep dive
- Performance patterns
- Cache-friendly code
- Benchmarking
Keywords
slow C++, profiling, perf, bottleneck, CPU profiler, cache miss
Practical tips
- Never optimize without a profile on realistic input.
- Compare before/after with fixed seeds and hardware when possible.
- Watch self time, not only inclusive time, to pick real hotspots.
Closing
“Slow” becomes actionable when a profiler shows where time goes. Fix algorithm + data layout + allocations first; micro-optimize only on evidence. Next: Cache-friendly coding and SIMD articles when CPU-bound.
More related posts
자주 묻는 질문 (FAQ)
Q. 이 내용을 실무에서 언제 쓰나요?
A. Beyond Big-O: copying, allocations, cache misses, branch mispredictions, virtual calls. Use perf and Visual Studio to fi… 실무에서는 위 본문의 예제와 선택 가이드를 참고해 적용하면 됩니다.
Q. 선행으로 읽으면 좋은 글은?
A. 각 글 하단의 이전 글 또는 관련 글 링크를 따라가면 순서대로 배울 수 있습니다. C++ 시리즈 목차에서 전체 흐름을 확인할 수 있습니다.
Q. 더 깊이 공부하려면?
A. cppreference와 해당 라이브러리 공식 문서를 참고하세요. 글 말미의 참고 자료 링크도 활용하면 좋습니다.
같이 보면 좋은 글 (내부 링크)
이 주제와 연결되는 다른 글입니다.
- [Arrays and Lists](/en/blog/algorithm-series-01-array-list/
- C++ 프로파일링 | ‘어디가 느린지 모르겠어요’ perf·gprof로 병목 찾기
- C++ 성능 최적화 | ‘10배 빠르게’ 실전 기법
- C++ Cache Friendly 코드 작성법 | 메모리 접근 패턴으로 성능 10배 향상
이 글에서 다루는 키워드 (관련 검색어)
C++, Performance, Profiling, perf, gprof, Bottleneck, Optimization 등으로 검색하시면 이 글이 도움이 됩니다.