Why Is My C++ Program Slow? Find Bottlenecks with Profiling (perf, VS Profiler)

Why Is My C++ Program Slow? Find Bottlenecks with Profiling (perf, VS Profiler)

이 글의 핵심

Measure first: seven common slowdown causes, perf/VS workflows, ten fixable patterns (copies, allocations, AoS vs SoA), and a five-step optimize loop.

Complexity basics: arrays and lists (Big-O intuition alongside profiling).

Introduction: “The code looks correct but it’s slow”

“Same complexity, but slower than Python sometimes”

When C++ feels slow, profiling turns guesses into hotspots: functions and lines that dominate time or hardware events.

This article covers:

  • Seven major causes of slowdown
  • Choosing a profiler
  • perf basics (Linux)
  • Visual Studio Profiler (Windows)
  • Ten common performance patterns
  • Case studies and a five-step tuning loop

Table of contents

  1. Seven root causes
  2. Profiler guide
  3. perf (Linux)
  4. Visual Studio Profiler
  5. Ten performance patterns
  6. Case studies
  7. Summary

1. Seven major causes (overview)

  1. Wrong asymptotics (e.g. nested loops vs hash set)
  2. Pass-by-value of large containers
  3. Excessive allocations inside hot loops
  4. Cache-unfriendly access patterns (stride, AoS vs SoA)
  5. Branch-heavy unpredictable control flow
  6. Virtual dispatch on hot inner loops
  7. Inefficient string building (repeated reallocations, excessive flushing)

Each has small code examples and fixes in the original article; the remedy is almost always measure → change data layout or algorithm → measure again.


2. Profiler guide

PlatformToolNotes
LinuxperfLow overhead, stack + HW counters
macOSInstrumentsGreat UI integration
WindowsVS ProfilerEasy CPU sampling
CrossValgrind/callgrindSlower, no recompile for some modes

3. perf (Linux)

perf record -g ./myapp
perf report
perf stat -e cache-misses,cache-references ./myapp

Flame graphs: fold stacks with Brendan Gregg’s FlameGraph scripts for visual hotspots.


4. Visual Studio

Debug → Performance Profiler → CPU Usage — inspect exclusive vs inclusive time and call trees.


5. Ten patterns (titles)

  1. Pass const T& instead of T for large inputs.
  2. Reuse buffers / reserve vectors in loops.
  3. reserve / ostringstream for string assembly.
  4. Prefer unordered_map when average O(1) beats tree map.
  5. SoA for hot fields vs AoS when you touch only part of a struct.
  6. Reduce virtual calls in inner loops (batch by type, CRTP, etc.—design-dependent).
  7. Avoid std::endl in tight loops (forces flush); use '\n'.
  8. Compile regexes once, not per iteration.
  9. Reduce lock contention with local buffers then merge.
  10. Prefer contiguous vector<int> over unique_ptr per element when possible.

6. Case studies (short)

  • JSON-like string building: reserve cut reallocations → large speedups.
  • N+1 queries: one JOIN vs per-row queries → orders of magnitude.
  • Image filters: raw pixel pointer vs virtual getPixel per pixel → fewer calls.

Five-step process

  1. Measure end-to-end time + profiler trace
  2. Identify top exclusive-time functions
  3. Hypothesize (allocations? copies? cache?)
  4. Change one thing at a time
  5. Re-measure; repeat until goals met

Summary

Checklist

  • Algorithm class appropriate?
  • Avoid large copies?
  • Hot loops allocation-free after reserve?
  • Cache-friendly traversal?
  • Locks not dominating?

Priority

  1. Algorithmic improvements
  2. Remove copies / tighten interfaces
  3. Allocation reduction
  4. Data layout / cache
  5. Compiler flags last—after correctness and profiling

  • Profiling deep dive
  • Performance patterns
  • Cache-friendly code
  • Benchmarking

Keywords

slow C++, profiling, perf, bottleneck, CPU profiler, cache miss

Practical tips

  • Never optimize without a profile on realistic input.
  • Compare before/after with fixed seeds and hardware when possible.
  • Watch self time, not only inclusive time, to pick real hotspots.

Closing

“Slow” becomes actionable when a profiler shows where time goes. Fix algorithm + data layout + allocations first; micro-optimize only on evidence.

Next: Cache-friendly coding and SIMD articles when CPU-bound.