When do I use this in practice?

Use profiling tools (perf, gprof, Valgrind) to locate bottlenecks, quantify performance, and spot optimization targets. Apply the workflows and examples in the article.

What should I read first?

Follow the “previous post” links at the bottom of each article in order, or use the C++ series index for the full path.

Where can I go deeper?

See cppreference and official docs for tools you use. Use the reference links at the end of the post.

C++ Profiling | Finding Bottlenecks with perf and gprof When You Don’t Know What’s Slow

2026년 2월 28일 · 28분 읽기 · 수정 2026년 3월 12일 Intermediate Tutorial

이 글의 핵심

Practical C++ profiling: perf, gprof, flame graphs, chrono timers, and how to find real bottlenecks instead of guessing.

Introduction: “I don’t know what’s slow”

Real-world scenarios

What often happens:
- You spend three days optimizing a function you “thought” was slow; the real bottleneck was file I/O.
- perf report shows ??? for symbols and you can’t analyze.
- You profile with gprof but gmon.out never appears.
- Valgrind runs 30× slower and feels impractical.
- Your API server is at 100% CPU and you don’t know which handler is hot.
- After 24 hours, memory grows from 2GB to 8GB (possible leak).
- Your algorithm is O(n) but gets much slower than linear as n grows (cache effects suspected).

In these situations, measurement beats guessing. Use a profiler to find hotspots, visualize with flame graphs, optimize the top ~20% of time first—that usually gives the best return.

Optimizing from guesses wastes time

The program felt slow, so you optimized from intuition. The real bottleneck (the part that limits overall performance) was elsewhere.

Wrong approach:

// “This function must be slow” — optimize it
void processData(std::vector<int>& data) {
    // complex optimization attempts...
}

// In reality this was the bottleneck
void loadData() {
    // file I/O is slow
}

After profiling:

processData: ~5% of time
loadData: ~80% of time ← real bottleneck

Lessons:

Don’t guess—measure
Find bottlenecks with a profiler
Optimize the slowest parts first

Profiling means measuring which functions use how much CPU or memory at runtime. Without it, “this part feels slow” often points at the wrong layer—I/O or another module may dominate. Use CPU sampling (e.g. perf) or instrumentation first to see where time goes, then optimize the top few percent.

End-to-end profiling flow

flowchart TD
    A[Program is slow] --> B[Guess without measuring]
    B --> C{Hit the bottleneck?}
    C -->|No| D[Wasted time]
    A --> E[Run profiling]
    E --> F[Find hotspots]
    F --> G[Optimize top ~20%]
    G --> H[Re-measure]
    H --> I{Goal met?}
    I -->|No| E
    I -->|Yes| J[Done]

After reading this article you will:

Use profiling tools effectively
Pinpoint bottlenecks accurately
Measure performance quantitatively
Optimize effectively in practice

What is profiling
Basic timing
Profiling tools
Complete profiling example
Bottleneck analysis
Practical optimization process
Common problems
Profiling best practices
Production profiling patterns
Checklist

1. What is profiling

Why measure performance

“Don’t guess—measure.”
- Intuition is often wrong
- Bottlenecks hide in unexpected places
- Optimization without measurement wastes time

Kinds of profiling

1. CPU profiling

Which functions use the most CPU
Call counts and time spent

2. Memory profiling

Memory usage
Allocation/deallocation counts
Leaks

3. Cache profiling

Cache miss counts
Access patterns

Profiling categories at a glance

flowchart LR
    subgraph CPU["CPU profiling"]
        C1[perf]
        C2[gprof]
        C3[VS Profiler]
    end
    subgraph MEM["Memory profiling"]
        M1[Valgrind Memcheck]
        M2[AddressSanitizer]
    end
    subgraph CACHE["Cache profiling"]
        K1[Valgrind Cachegrind]
        K2[perf stat]
    end

2. Basic timing

Measuring with `std::chrono`

Since C++11, std::chrono can measure intervals. Take high_resolution_clock::now() at start and end, subtract to get a duration, then duration_cast to milliseconds or microseconds. That turns “feels slow” into numbers.

// After pasting: g++ -std=c++17 -o profile_time profile_time.cpp && ./profile_time
#include <chrono>
#include <iostream>

void slowFunction() {
    // heavy work...
}

int main() {
    auto start = std::chrono::high_resolution_clock::now();
    slowFunction();
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    std::cout << "Time: " << duration.count() << " ms\n";
    return 0;
}

Sample output: Time: N ms (N depends on the environment).

Details:

high_resolution_clock: finest clock available
now(): current time as time_point
duration_cast: convert e.g. to milliseconds
count(): integer value in that unit

RAII timer helper

Record time in the constructor and print elapsed time in the destructor—classic RAII timer. { Timer t("slowFunction"); slowFunction(); } prints when the scope ends. Exceptions and early returns still run the destructor, so you miss fewer “end times” than manual prints.

class Timer {
    std::chrono::high_resolution_clock::time_point start;
    const char* name;

public:
    Timer(const char* n) : name(n) {
        start = std::chrono::high_resolution_clock::now();
    }

    ~Timer() {
        auto end = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
        std::cout << name << ": " << duration.count() << " us\n";
    }
};

void processData() {
    Timer timer("processData");
    // work...
}  // prints automatically in destructor

Note: Keep the Timer in the right scope—use { } blocks so the measured region is clear.

Multiple sections

class Profiler {
    std::map<std::string, long long> timings;
    std::chrono::high_resolution_clock::time_point start;

public:
    void startTimer() {
        start = std::chrono::high_resolution_clock::now();
    }

    void record(const std::string& name) {
        auto end = std::chrono::high_resolution_clock::now();
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
        timings[name] += duration.count();
        start = end;
    }

    void report() {
        for (const auto& [name, time] : timings) {
            std::cout << name << ": " << time << " us\n";
        }
    }
};

int main() {
    Profiler prof;

    prof.startTimer();
    loadData();
    prof.record("loadData");

    processData();
    prof.record("processData");

    saveData();
    prof.record("saveData");

    prof.report();
}

Usage: Each record() adds the time since the previous record() (or startTimer()). start = end advances to the next segment; repeat to accumulate totals.

3. Profiling tools

Choosing a tool

flowchart TD
    A[Need profiling] --> B{Platform?}
    B -->|Linux| C[perf]
    B -->|Linux/Mac| D[gprof]
    B -->|Linux/Mac| E[Valgrind]
    B -->|Windows| F[VS Profiler]
    C --> G[CPU sampling]
    D --> H[Instrumentation]
    E --> I[Memory/cache]
    F --> C

perf (Linux)

The standard Linux profiler. Sampling records which function is on-CPU periodically—low overhead, usable even in production-like settings.

# Profile while running
perf record ./myapp

# View results
perf report

# Per-function stats
perf stat ./myapp

Example output:

  50.23%  myapp  [.] processData
  30.45%  myapp  [.] loadFile
  15.32%  myapp  [.] parseJson

perf report tips:

# Include call graph
perf record -g ./myapp

# Text report
perf report --stdio

# Filter symbol
perf report --symbol-filter=processData

Interpreting perf stat:

 Performance counter stats for './myapp':

          1,234.56 msec task-clock
                42      context-switches
                 0      cpu-migrations
               128      page-faults
     3,456,789,012      cycles
     2,345,678,901      instructions

task-clock: CPU time (ms)
context-switches: context switch count
page-faults: page fault count
cycles, instructions: hardware counters

IPC (instructions per cycle): instructions / cycles near 1 suggests efficient CPU use; well below 0.5 may indicate memory stalls or bad branch prediction.

gprof (GNU profiler)

Compile with -pg to inject profiling code. Running produces gmon.out; gprof reports per-function time and call counts.

g++ -pg -O2 main.cpp -o myapp
./myapp
gprof myapp gmon.out

Sample gprof output:

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 80.0      0.80     0.80        1   800.00   800.00  loadFile
 15.0      0.95     0.15      100     1.50     1.50  processData
  5.0      1.00     0.05        1    50.00    50.00  saveResult

Note: -pg with -O2 can inline and merge functions—use -O0/-O1 if you need clearer call relationships.

Valgrind Callgrind

Simulates execution step by step—accurate call graphs and cache info, but 10–50× slower—use on short runs only.

valgrind --tool=callgrind ./myapp
callgrind_annotate callgrind.out.12345
# GUI: kcachegrind

Options:

valgrind --tool=callgrind --cache-sim=yes ./myapp
valgrind --tool=callgrind --toggle-collect=processData ./myapp

Visual Studio Profiler

1. Debug → Performance Profiler
2. CPU Usage
3. Start, run app
4. Inspect Hot Path and per-function time

Tool comparison

Tool	Platform	Method	Overhead	Production
perf	Linux	Sampling	Low (~5%)	Yes
gprof	Linux/Mac	Instrumentation	Medium (~10%)	Sometimes
Valgrind	Linux/Mac	Simulation	Very high (10–50×)	No
VS Profiler	Windows	Sampling	Low	Yes

Flame graphs

Flame graphs stack frames bottom-up; width shows share of CPU time—great for spotting hot paths.

perf record -F 99 -g ./myapp
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

How to read:

Width: fraction of CPU time on that path
Height: call stack (caller below, callee above)
Wide bars: hottest paths

Full flame graph workflow:

git clone --depth 1 https://github.com/brendangregg/FlameGraph
export PATH="$PATH:$(pwd)/FlameGraph"

perf record -F 99 -g --call-graph dwarf,8192 ./myapp
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
open flamegraph.svg   # macOS
# xdg-open flamegraph.svg  # Linux

Common patterns:

Pattern	Meaning	Action
Wide `memcpy`	Copy-bound	Buffer pools, zero-copy
Wide `malloc`/`free`	Allocation cost	Pools, arenas
Wide `std::sort`	Sort cost	Avoid sort, partial sort
Wide `pthread_mutex_lock`	Lock wait	Smaller critical sections, lock-free where safe

4. Complete profiling example

Target program

// profile_target.cpp — analyze with perf, gprof
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>
#include <iostream>

void processDataCacheUnfriendly(std::vector<int>& data) {
    const size_t stride = 16;
    for (size_t i = 0; i < data.size(); i += stride) {
        data[i] = data[i] * 2 + 1;
    }
}

void processDataCacheFriendly(std::vector<int>& data) {
    for (size_t i = 0; i < data.size(); ++i) {
        data[i] = data[i] * 2 + 1;
    }
}

void sortData(std::vector<int>& data) {
    std::sort(data.begin(), data.end());
}

void fillRandom(std::vector<int>& data) {
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(1, 1000000);
    for (auto& v : data) {
        v = dis(gen);
    }
}

int main() {
    const size_t N = 10'000'000;
    std::vector<int> data(N);

    fillRandom(data);
    sortData(data);
    processDataCacheUnfriendly(data);
    processDataCacheFriendly(data);

    return 0;
}

perf example

g++ -std=c++17 -O2 -g -o profile_target profile_target.cpp
perf record -F 99 -g --call-graph dwarf,8192 ./profile_target
perf report --stdio
perf stat -e cycles,instructions,cache-references,cache-misses ./profile_target

Sample perf report --stdio:

#   45.23%  profile_target    [.] sortData
#   28.10%  profile_target    [.] fillRandom
#   12.30%  profile_target    [.] processDataCacheUnfriendly
#   10.00%  profile_target    [.] processDataCacheFriendly

Hotspot: sortData ~45% → consider algorithm changes or removing sort.

gprof example

g++ -std=c++17 -O2 -pg -g -o profile_target_gprof profile_target.cpp
./profile_target_gprof
gprof -p profile_target_gprof gmon.out
gprof -q profile_target_gprof gmon.out
gprof profile_target_gprof gmon.out > gprof_report.txt

Reading gprof: focus on % time, self seconds, calls.

Hotspot workflow

flowchart TD
    A[Run program] --> B[perf record -g]
    B --> C[perf report]
    C --> D{Top 3 functions?}
    D --> E[Widest bar = bottleneck]
    E --> F[Refine with Timer]
    F --> G[Pick optimization target]
    G --> H[Re-measure after fix]

5. Bottleneck analysis

Finding hotspots

// Profiling says:
// 80% - loadFile()      ← bottleneck!
// 15% - processData()
// 5%  - saveResult()

void loadFile(const std::string& path) {
    Timer timer("loadFile");
    { Timer t("open"); file.open(path); }
    { Timer t("read"); /* read... slow here */ }
    { Timer t("parse"); /* parse... */ }
}

Call counts (simple instrumentation)

class CallCounter {
    static std::map<std::string, int> counts;
    std::string name;

public:
    CallCounter(const char* n) : name(n) {
        counts[name]++;
    }

    static void report() {
        for (const auto& [name, count] : counts) {
            std::cout << name << ": " << count << " calls\n";
        }
    }
};

std::map<std::string, int> CallCounter::counts;

Pareto (80/20)

~80% of runtime often comes from the top ~20% of functions.
Optimizing those first yields most of the win.

6. Practical optimization process

Measure baseline (chrono, benchmarks)
Profile (perf record -g, etc.)
Optimize the real hotspot (e.g. reserve for vectors)
Re-measure
Repeat

Benchmarking tips

Warm up caches before timing
Run multiple iterations and average or take median
Use -O2/-O3 for release-like numbers when that matches production

Memory profiling

valgrind --leak-check=full ./myapp

AddressSanitizer (faster than Valgrind for many bugs):

g++ -g -O1 -fsanitize=address -fno-omit-frame-pointer main.cpp -o myapp
./myapp

7. Common problems

perf permission denied

Lower kernel.perf_event_paranoid or run with appropriate privileges (see your distro docs).

No `gmon.out`

Ensure -pg and normal process exit (not only Ctrl+C/abort in some setups).

Valgrind too slow

Use smaller inputs, or use perf for CPU-only work.

Symbols show as `???`

Build with -g, avoid stripping debug info.

Inlined functions disappear from profile

Try -O1/-O0 for profiling builds, or mark critical functions __attribute__((noinline)).

7. Checklist (duplicate section id in source)

Before profiling

-g for symbols
Choose optimization level (-O1 often balances accuracy vs reality)
perf: check perf_event_paranoid
gprof: -pg
Valgrind: shrink workload

After profiling

Principles

Measure, don’t guess
Fix big bottlenecks first
Compare before/after
Use profilers systematically

Cache-friendly C++
Compile-time optimization
Compiler optimization PGO/LTO

Keywords (search)

C++ profiling, perf, gprof, Valgrind, bottleneck, performance measurement, optimization, flame graph

Summary

Tool	Platform	Role
perf	Linux	CPU sampling
gprof	Linux/Mac	Per-function time
Valgrind	Linux/Mac	Memory, cache
VS Profiler	Windows	CPU, memory
std::chrono	All	Manual timing

Principles: measure first; optimize hotspots; compare before/after; use profilers.

Practical tips

Debugging

Fix compiler warnings first
Reproduce with a small test case

Performance

Don’t optimize without profiling
Define measurable goals

Code review

Check common review feedback early
Follow team conventions

FAQ

When is this useful in practice?

A. Finding bottlenecks with perf/gprof/Valgrind, measuring performance, and choosing what to optimize—use the article’s workflows and examples.

perf vs gprof?

A. On Linux, prefer perf (sampling, low overhead). gprof needs -pg and rebuilds. For exact call graphs on short runs, consider Callgrind.

Does profiling slow the app?

A. perf sampling is usually ~5% overhead. gprof instrumentation is higher. Valgrind is 10–50×—short runs only.

What to read first?

A. Follow previous-post links or the C++ series index.

Go deeper?

A. See cppreference and official tool documentation.

One-line summary: Use chrono and profilers to find real hotspots, then optimize. Next: cache-friendly code (#15-2).

Next: C++ practical guide #15-2: cache-friendly code

Previous: Perfect forwarding (#14-2)

Cache optimization
Compile-time optimization
Slow program causes
Advanced profiling
STL algorithms basics