C++ Profiling | Finding Bottlenecks with perf and gprof When You Don’t Know What’s Slow

C++ Profiling | Finding Bottlenecks with perf and gprof When You Don’t Know What’s Slow

이 글의 핵심

Practical C++ profiling: perf, gprof, flame graphs, chrono timers, and how to find real bottlenecks instead of guessing.

Introduction: “I don’t know what’s slow”

Real-world scenarios

What often happens:
- You spend three days optimizing a function you “thought” was slow; the real bottleneck was file I/O.
- perf report shows ??? for symbols and you can’t analyze.
- You profile with gprof but gmon.out never appears.
- Valgrind runs 30× slower and feels impractical.
- Your API server is at 100% CPU and you don’t know which handler is hot.
- After 24 hours, memory grows from 2GB to 8GB (possible leak).
- Your algorithm is O(n) but gets much slower than linear as n grows (cache effects suspected).

In these situations, measurement beats guessing. Use a profiler to find hotspots, visualize with flame graphs, optimize the top ~20% of time first—that usually gives the best return.

Optimizing from guesses wastes time

The program felt slow, so you optimized from intuition. The real bottleneck (the part that limits overall performance) was elsewhere.

Wrong approach:

// “This function must be slow” — optimize it
void processData(std::vector<int>& data) {
    // complex optimization attempts...
}

// In reality this was the bottleneck
void loadData() {
    // file I/O is slow
}

After profiling:

  • processData: ~5% of time
  • loadData: ~80% of time ← real bottleneck

Lessons:

  • Don’t guess—measure
  • Find bottlenecks with a profiler
  • Optimize the slowest parts first

Profiling means measuring which functions use how much CPU or memory at runtime. Without it, “this part feels slow” often points at the wrong layer—I/O or another module may dominate. Use CPU sampling (e.g. perf) or instrumentation first to see where time goes, then optimize the top few percent.

End-to-end profiling flow

flowchart TD
    A[Program is slow] --> B[Guess without measuring]
    B --> C{Hit the bottleneck?}
    C -->|No| D[Wasted time]
    A --> E[Run profiling]
    E --> F[Find hotspots]
    F --> G[Optimize top ~20%]
    G --> H[Re-measure]
    H --> I{Goal met?}
    I -->|No| E
    I -->|Yes| J[Done]

After reading this article you will:

  • Use profiling tools effectively
  • Pinpoint bottlenecks accurately
  • Measure performance quantitatively
  • Optimize effectively in practice

Table of contents

  1. What is profiling
  2. Basic timing
  3. Profiling tools
  4. Complete profiling example
  5. Bottleneck analysis
  6. Practical optimization process
  7. Common problems
  8. Profiling best practices
  9. Production profiling patterns
  10. Checklist

1. What is profiling

Why measure performance

“Don’t guess—measure.”
- Intuition is often wrong
- Bottlenecks hide in unexpected places
- Optimization without measurement wastes time

Kinds of profiling

1. CPU profiling

  • Which functions use the most CPU
  • Call counts and time spent

2. Memory profiling

  • Memory usage
  • Allocation/deallocation counts
  • Leaks

3. Cache profiling

  • Cache miss counts
  • Access patterns

Profiling categories at a glance

flowchart LR
    subgraph CPU["CPU profiling"]
        C1[perf]
        C2[gprof]
        C3[VS Profiler]
    end
    subgraph MEM["Memory profiling"]
        M1[Valgrind Memcheck]
        M2[AddressSanitizer]
    end
    subgraph CACHE["Cache profiling"]
        K1[Valgrind Cachegrind]
        K2[perf stat]
    end

2. Basic timing

Measuring with std::chrono

Since C++11, std::chrono can measure intervals. Take high_resolution_clock::now() at start and end, subtract to get a duration, then duration_cast to milliseconds or microseconds. That turns “feels slow” into numbers.

// After pasting: g++ -std=c++17 -o profile_time profile_time.cpp && ./profile_time
#include <chrono>
#include <iostream>

void slowFunction() {
    // heavy work...
}

int main() {
    auto start = std::chrono::high_resolution_clock::now();
    slowFunction();
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    std::cout << "Time: " << duration.count() << " ms\n";
    return 0;
}

Sample output: Time: N ms (N depends on the environment).

Details:

  • high_resolution_clock: finest clock available
  • now(): current time as time_point
  • duration_cast: convert e.g. to milliseconds
  • count(): integer value in that unit

RAII timer helper

Record time in the constructor and print elapsed time in the destructor—classic RAII timer. { Timer t("slowFunction"); slowFunction(); } prints when the scope ends. Exceptions and early returns still run the destructor, so you miss fewer “end times” than manual prints.

class Timer {
    std::chrono::high_resolution_clock::time_point start;
    const char* name;

public:
    Timer(const char* n) : name(n) {
        start = std::chrono::high_resolution_clock::now();
    }

    ~Timer() {
        auto end = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
        std::cout << name << ": " << duration.count() << " us\n";
    }
};

void processData() {
    Timer timer("processData");
    // work...
}  // prints automatically in destructor

Note: Keep the Timer in the right scope—use { } blocks so the measured region is clear.

Multiple sections

class Profiler {
    std::map<std::string, long long> timings;
    std::chrono::high_resolution_clock::time_point start;

public:
    void startTimer() {
        start = std::chrono::high_resolution_clock::now();
    }

    void record(const std::string& name) {
        auto end = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
        timings[name] += duration.count();
        start = end;
    }

    void report() {
        for (const auto& [name, time] : timings) {
            std::cout << name << ": " << time << " us\n";
        }
    }
};

int main() {
    Profiler prof;

    prof.startTimer();
    loadData();
    prof.record("loadData");

    processData();
    prof.record("processData");

    saveData();
    prof.record("saveData");

    prof.report();
}

Usage: Each record() adds the time since the previous record() (or startTimer()). start = end advances to the next segment; repeat to accumulate totals.


3. Profiling tools

Choosing a tool

flowchart TD
    A[Need profiling] --> B{Platform?}
    B -->|Linux| C[perf]
    B -->|Linux/Mac| D[gprof]
    B -->|Linux/Mac| E[Valgrind]
    B -->|Windows| F[VS Profiler]
    C --> G[CPU sampling]
    D --> H[Instrumentation]
    E --> I[Memory/cache]
    F --> C

perf (Linux)

The standard Linux profiler. Sampling records which function is on-CPU periodically—low overhead, usable even in production-like settings.

# Profile while running
perf record ./myapp

# View results
perf report

# Per-function stats
perf stat ./myapp

Example output:

  50.23%  myapp  [.] processData
  30.45%  myapp  [.] loadFile
  15.32%  myapp  [.] parseJson

perf report tips:

# Include call graph
perf record -g ./myapp

# Text report
perf report --stdio

# Filter symbol
perf report --symbol-filter=processData

Interpreting perf stat:

 Performance counter stats for './myapp':

          1,234.56 msec task-clock
                42      context-switches
                 0      cpu-migrations
               128      page-faults
     3,456,789,012      cycles
     2,345,678,901      instructions
  • task-clock: CPU time (ms)
  • context-switches: context switch count
  • page-faults: page fault count
  • cycles, instructions: hardware counters

IPC (instructions per cycle): instructions / cycles near 1 suggests efficient CPU use; well below 0.5 may indicate memory stalls or bad branch prediction.

gprof (GNU profiler)

Compile with -pg to inject profiling code. Running produces gmon.out; gprof reports per-function time and call counts.

g++ -pg -O2 main.cpp -o myapp
./myapp
gprof myapp gmon.out

Sample gprof output:

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 80.0      0.80     0.80        1   800.00   800.00  loadFile
 15.0      0.95     0.15      100     1.50     1.50  processData
  5.0      1.00     0.05        1    50.00    50.00  saveResult

Note: -pg with -O2 can inline and merge functions—use -O0/-O1 if you need clearer call relationships.

Valgrind Callgrind

Simulates execution step by step—accurate call graphs and cache info, but 10–50× slower—use on short runs only.

valgrind --tool=callgrind ./myapp
callgrind_annotate callgrind.out.12345
# GUI: kcachegrind

Options:

valgrind --tool=callgrind --cache-sim=yes ./myapp
valgrind --tool=callgrind --toggle-collect=processData ./myapp

Visual Studio Profiler

1. Debug → Performance Profiler
2. CPU Usage
3. Start, run app
4. Inspect Hot Path and per-function time

Tool comparison

ToolPlatformMethodOverheadProduction
perfLinuxSamplingLow (~5%)Yes
gprofLinux/MacInstrumentationMedium (~10%)Sometimes
ValgrindLinux/MacSimulationVery high (10–50×)No
VS ProfilerWindowsSamplingLowYes

Flame graphs

Flame graphs stack frames bottom-up; width shows share of CPU time—great for spotting hot paths.

perf record -F 99 -g ./myapp
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

How to read:

  • Width: fraction of CPU time on that path
  • Height: call stack (caller below, callee above)
  • Wide bars: hottest paths

Full flame graph workflow:

git clone --depth 1 https://github.com/brendangregg/FlameGraph
export PATH="$PATH:$(pwd)/FlameGraph"

perf record -F 99 -g --call-graph dwarf,8192 ./myapp
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
open flamegraph.svg   # macOS
# xdg-open flamegraph.svg  # Linux

Common patterns:

PatternMeaningAction
Wide memcpyCopy-boundBuffer pools, zero-copy
Wide malloc/freeAllocation costPools, arenas
Wide std::sortSort costAvoid sort, partial sort
Wide pthread_mutex_lockLock waitSmaller critical sections, lock-free where safe

4. Complete profiling example

Target program

// profile_target.cpp — analyze with perf, gprof
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>
#include <iostream>

void processDataCacheUnfriendly(std::vector<int>& data) {
    const size_t stride = 16;
    for (size_t i = 0; i < data.size(); i += stride) {
        data[i] = data[i] * 2 + 1;
    }
}

void processDataCacheFriendly(std::vector<int>& data) {
    for (size_t i = 0; i < data.size(); ++i) {
        data[i] = data[i] * 2 + 1;
    }
}

void sortData(std::vector<int>& data) {
    std::sort(data.begin(), data.end());
}

void fillRandom(std::vector<int>& data) {
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(1, 1000000);
    for (auto& v : data) {
        v = dis(gen);
    }
}

int main() {
    const size_t N = 10'000'000;
    std::vector<int> data(N);

    fillRandom(data);
    sortData(data);
    processDataCacheUnfriendly(data);
    processDataCacheFriendly(data);

    return 0;
}

perf example

g++ -std=c++17 -O2 -g -o profile_target profile_target.cpp
perf record -F 99 -g --call-graph dwarf,8192 ./profile_target
perf report --stdio
perf stat -e cycles,instructions,cache-references,cache-misses ./profile_target

Sample perf report --stdio:

#   45.23%  profile_target    [.] sortData
#   28.10%  profile_target    [.] fillRandom
#   12.30%  profile_target    [.] processDataCacheUnfriendly
#   10.00%  profile_target    [.] processDataCacheFriendly

Hotspot: sortData ~45% → consider algorithm changes or removing sort.

gprof example

g++ -std=c++17 -O2 -pg -g -o profile_target_gprof profile_target.cpp
./profile_target_gprof
gprof -p profile_target_gprof gmon.out
gprof -q profile_target_gprof gmon.out
gprof profile_target_gprof gmon.out > gprof_report.txt

Reading gprof: focus on % time, self seconds, calls.

Hotspot workflow

flowchart TD
    A[Run program] --> B[perf record -g]
    B --> C[perf report]
    C --> D{Top 3 functions?}
    D --> E[Widest bar = bottleneck]
    E --> F[Refine with Timer]
    F --> G[Pick optimization target]
    G --> H[Re-measure after fix]

5. Bottleneck analysis

Finding hotspots

// Profiling says:
// 80% - loadFile()      ← bottleneck!
// 15% - processData()
// 5%  - saveResult()

void loadFile(const std::string& path) {
    Timer timer("loadFile");
    { Timer t("open"); file.open(path); }
    { Timer t("read"); /* read... slow here */ }
    { Timer t("parse"); /* parse... */ }
}

Call counts (simple instrumentation)

class CallCounter {
    static std::map<std::string, int> counts;
    std::string name;

public:
    CallCounter(const char* n) : name(n) {
        counts[name]++;
    }

    static void report() {
        for (const auto& [name, count] : counts) {
            std::cout << name << ": " << count << " calls\n";
        }
    }
};

std::map<std::string, int> CallCounter::counts;

Pareto (80/20)

~80% of runtime often comes from the top ~20% of functions.
Optimizing those first yields most of the win.

6. Practical optimization process

  1. Measure baseline (chrono, benchmarks)
  2. Profile (perf record -g, etc.)
  3. Optimize the real hotspot (e.g. reserve for vectors)
  4. Re-measure
  5. Repeat

Benchmarking tips

  • Warm up caches before timing
  • Run multiple iterations and average or take median
  • Use -O2/-O3 for release-like numbers when that matches production

Memory profiling

valgrind --leak-check=full ./myapp

AddressSanitizer (faster than Valgrind for many bugs):

g++ -g -O1 -fsanitize=address -fno-omit-frame-pointer main.cpp -o myapp
./myapp

7. Common problems

perf permission denied

Lower kernel.perf_event_paranoid or run with appropriate privileges (see your distro docs).

No gmon.out

Ensure -pg and normal process exit (not only Ctrl+C/abort in some setups).

Valgrind too slow

Use smaller inputs, or use perf for CPU-only work.

Symbols show as ???

Build with -g, avoid stripping debug info.

Inlined functions disappear from profile

Try -O1/-O0 for profiling builds, or mark critical functions __attribute__((noinline)).


7. Checklist (duplicate section id in source)

Before profiling

  • -g for symbols
  • Choose optimization level (-O1 often balances accuracy vs reality)
  • perf: check perf_event_paranoid
  • gprof: -pg
  • Valgrind: shrink workload

After profiling

  • Identify top ~20% functions
  • Drill down with timers
  • Record baseline before changes
  • Re-measure after changes
  • Regression-test behavior

Principles

  • Measure, don’t guess
  • Fix big bottlenecks first
  • Compare before/after
  • Use profilers systematically

  • Cache-friendly C++
  • Compile-time optimization
  • Compiler optimization PGO/LTO

C++ profiling, perf, gprof, Valgrind, bottleneck, performance measurement, optimization, flame graph

Summary

ToolPlatformRole
perfLinuxCPU sampling
gprofLinux/MacPer-function time
ValgrindLinux/MacMemory, cache
VS ProfilerWindowsCPU, memory
std::chronoAllManual timing

Principles: measure first; optimize hotspots; compare before/after; use profilers.

Practical tips

Debugging

  • Fix compiler warnings first
  • Reproduce with a small test case

Performance

  • Don’t optimize without profiling
  • Define measurable goals

Code review

  • Check common review feedback early
  • Follow team conventions

FAQ

When is this useful in practice?

A. Finding bottlenecks with perf/gprof/Valgrind, measuring performance, and choosing what to optimize—use the article’s workflows and examples.

perf vs gprof?

A. On Linux, prefer perf (sampling, low overhead). gprof needs -pg and rebuilds. For exact call graphs on short runs, consider Callgrind.

Does profiling slow the app?

A. perf sampling is usually ~5% overhead. gprof instrumentation is higher. Valgrind is 10–50×—short runs only.

What to read first?

A. Follow previous-post links or the C++ series index.

Go deeper?

A. See cppreference and official tool documentation.

One-line summary: Use chrono and profilers to find real hotspots, then optimize. Next: cache-friendly code (#15-2).

Next: C++ practical guide #15-2: cache-friendly code

Previous: Perfect forwarding (#14-2)


  • Cache optimization
  • Compile-time optimization
  • Slow program causes
  • Advanced profiling
  • STL algorithms basics