본문으로 건너뛰기 [2026] C++ Advanced Profiling Guide | perf, gprof, Valgrind, VTune, Tracy [#51-1]

[2026] C++ Advanced Profiling Guide | perf, gprof, Valgrind, VTune, Tracy [#51-1]

[2026] C++ Advanced Profiling Guide | perf, gprof, Valgrind, VTune, Tracy [#51-1]

이 글의 핵심

When your multithreaded C++ game server burns 60% CPU and you cannot find the bottleneck: master perf, gprof, Valgrind (Callgrind, Cachegrind, Memcheck), VTune, Tracy, flame graphs, and cache analysis with practical commands and benchmarks.

Introduction: “Our multithreaded server uses 60% CPU and we do not know where”

An everyday analogy: concurrency is like one cook switching between pots; parallelism is like several cooks working on different dishes at once.

Problem scenarios

Situations you actually hit:
- Game server uses 5 of 8 cores at 100%, but you do not know which function is hot
- perf report shows ??? for symbols, so you cannot analyze
- People say “lots of cache misses” but you do not know how to measure
- You want per-frame latency in real time, but gprof cannot do that
- You suspect a memory leak but cannot trace where it comes from
- gprof’s call graph is said to be inaccurate
- Valgrind runs 30× slower and feels impractical

More scenarios: API server at 100% CPU with unknown handler; memory grows 2 GB → 8 GB over 24 hours (leak?); O(n) work gets slower than linear as n grows (suspect cache).

Beyond basic profiling (cpp-series-15-1):

  • Advanced perf: flame graphs, cache events, reading stacks
  • Intel VTune: CPU pipeline, memory bandwidth, thread synchronization
  • Tracy: real-time frame profiling tuned for games and interactive apps

After reading this article you will be able to:

  • Build flame graphs with perf and see bottlenecks visually
  • Use gprof for call graphs and flat profiles (limits included)
  • Use Valgrind (Callgrind, Cachegrind, Memcheck) for memory and cache work
  • Quantify cache misses and branch mispredictions with VTune
  • Monitor per-frame latency in real time with Tracy
  • Apply safer sampling patterns in production

Expected environment: C++17 or newer, Linux (perf), Intel CPU (VTune), CMake (Tracy)


Experience from real projects: this article is based on real bottlenecks and fixes from large C++ codebases, including pitfalls and debugging tips you rarely see in textbooks.

Table of contents

  1. Problem scenarios and tool choice
  2. Advanced perf: flame graphs and cache profiling
  3. gprof: call graph and flat profile
  4. Valgrind: Callgrind, Cachegrind, Memcheck
  5. Intel VTune: CPU pipeline analysis
  6. Tracy: real-time profiler
  7. Full benchmark example
  8. How to read flame graphs
  9. Common issues and fixes
  10. Profiling benchmark comparison
  11. Profiling best practices
  12. Production profiling patterns
  13. Checklists

1. Problem scenarios and tool choice

When to use which tool?

flowchart TD
    A[Performance issue] --> B{Type?}
    B -->|CPU bottleneck| C{Environment?}
    B -->|Memory leak / errors| D[Valgrind Memcheck]
    B -->|Cache efficiency| E[Valgrind Cachegrind]
    C -->|Linux server| F{Intel CPU?}
    C -->|Game / real-time app| G[Tracy]
    F -->|Yes| H{Deep dive?}
    F -->|No / AMD| I[perf]
    H -->|Yes: cache / pipeline| J[Intel VTune]
    H -->|No| I
    I --> K[Flame graph]
    G --> L[Real-time timeline]

Tool comparison

ToolOverheadProductionStrengthsWeaknesses
perf1–5%✅ Often OKFree, standard on Linux, flame graphsSome events limited on AMD
gprof5–15%△ SometimesCall graph, easy to enableInaccurate sampling, ignores inlining
Valgrind10–50× slower❌ NoLeaks, cache simulationVery slow; short runs only
VTune5–15%△ StagingDeep cache/pipelineIntel-only, commercial
Tracy0.1–1%△ OptionalReal time, per frameRequires code changes

Profiling workflow

sequenceDiagram
    participant Dev as Developer
    participant Perf as perf
    participant Valgrind as Valgrind
    participant VTune as VTune
    participant Tracy as Tracy
    Dev->>Perf: 1. perf record (quick hotspot search)
    Perf->>Dev: Flame graph, top functions
    Dev->>Valgrind: 2. Memcheck (if leak suspected)
    Valgrind->>Dev: Leak sites, bad accesses
    Dev->>VTune: 3. VTune (if cache/pipeline suspected)
    VTune->>Dev: Cache miss, branch prediction reports
    Dev->>Tracy: 4. Tracy (real-time frame analysis)
    Tracy->>Dev: Per-frame latency timeline

2. Advanced perf: flame graphs and cache profiling

Advanced perf record options

# Sampling rate: 99 Hz — default, good for finding hotspots
perf record -F 99 -g ./myapp
# 999 Hz: finer sampling (more overhead)
perf record -F 999 -g ./myapp
# Stack depth (defaults exist; tune if needed)
perf record -F 99 --call-graph dwarf,4096 ./myapp
# Event: cache misses
perf record -e cache-misses -F 99 -g ./myapp
# Only CPUs 0 and 1 (useful for multithreaded apps)
perf record -C 0,1 -F 99 -g ./myapp

Option notes:

  • -F 99: 99 samples per second → low overhead, usually enough for hotspots
  • -g: collect stacks (required for flame graphs)
  • --call-graph dwarf: unwind stacks with DWARF (more accurate)
  • -e cache-misses: cache-miss events (L1/L2/L3)

Building a flame graph (end-to-end)

# 1. Collect perf data while the app runs
perf record -F 99 -g -- ./myapp
# 2. Install FlameGraph once
git clone https://github.com/brendangregg/FlameGraph
export PATH=$PATH:$(pwd)/FlameGraph
# 3. Build SVG flame graph
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
# 4. Open in a browser
open flamegraph.svg   # macOS
xdg-open flamegraph.svg  # Linux

perf stat: hardware counters

# Default stats
perf stat ./myapp
# Detailed cache counters
perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses ./myapp
# Repeat runs for averages
perf stat -r 5 ./myapp

Reading the output:

 Performance counter stats for './myapp' (5 runs):
       1,234.56 msec task-clock                # CPU time
              42      context-switches        # Many → thread switching cost
               0      cpu-migrations
             128      page-faults
   3,456,789,012      cycles
   2,345,678,901      instructions            # 1.52 insn per cycle
     123,456,789      cache-references
      12,345,678      cache-misses            # 10.0% miss rate!

Key metrics:

  • IPC (instructions per cycle): instructions / cycles — above ~1.0 is generally good; below ~0.5 often means memory-bound.
  • Cache miss rate: cache-misses / cache-references — above ~10% suggests revisiting access patterns.

perf annotate: line-level hotspots

perf annotate -s processData
# After perf record
perf report
# Press 'a' for annotate, 's' for symbol sort

3. gprof: call graph and flat profile

What gprof is

gprof is a flat profiler shipped with GCC. Compile with -pg, run to produce gmon.out, then inspect per-function CPU share and the call graph. Useful on legacy systems without perf, but inlined functions and shared libraries can skew results.

Full gprof workflow

# 1. Compile with -pg (can combine with optimization)
g++ -std=c++17 -O2 -pg -g -o myapp profile_target.cpp
# 2. Run (writes gmon.out)
./myapp
# 3. Flat profile (time share per function)
gprof myapp gmon.out
# 4. Call graph only
gprof -q myapp gmon.out
# 5. Flat profile only (no graph)
gprof -p myapp gmon.out
# 6. Save report to a file
gprof myapp gmon.out > gprof_report.txt

Reading gprof output

Flat profile:
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 45.23      2.15     2.15        1  2150.00  2150.00  sortData
 28.10      3.48     1.33        1  1330.00  1330.00  fillRandom
 12.30      4.07     0.59        1   590.00   590.00  processDataCacheUnfriendly
 10.00      4.54     0.48        1   480.00   480.00  processDataCacheFriendly

Important columns:

  • % time: fraction of total time in that function
  • self seconds: time spent in the function body
  • calls: number of invocations
  • total: time including callees

Call graph example

index % time    self  children    called     name
[1]    100.0    0.00    4.54                 main [1]
                2.15    0.00       1/1           sortData [2]
                1.33    0.00       1/1           fillRandom [3]
[2]     47.4    2.15    0.00       1         sortData [2]

Limits and alternatives

LimitWhat happensAlternative
Inlining ignoredWith -O2, inlined cost rolls into parentsperf with DWARF stacks
Shared libraries.so internals can be fuzzyperf, VTune
MultithreadingNot split per threadperf -C, VTune threading
Fixed samplingShort functions may be missedTune perf -F

4. Valgrind: Callgrind, Cachegrind, Memcheck

What Valgrind is

Valgrind uses dynamic binary instrumentation: it runs your program on a synthetic CPU to analyze memory, cache, and calls in detail. Expect 10–50× slowdown — use short runs or unit tests.

Valgrind tools compared

ToolRoleOutput
CallgrindCPU profiling, call countscallgrind.out.*, visualize in KCachegrind
CachegrindL1/L2/L3 miss simulationCache statistics
MemcheckLeaks, invalid accessReports with file:line

Callgrind: CPU profiling

valgrind --tool=callgrind ./myapp
# Output: callgrind.out.<pid>
# qcachegrind callgrind.out.12345
callgrind_annotate callgrind.out.12345
callgrind_annotate --inclusive=yes callgrind.out.12345 | head -80

Reading output: callgrind_annotate shows instructions retired (Ir) per function — the top entries are usual suspects.

Cachegrind: cache misses

valgrind --tool=cachegrind ./myapp
# Example lines:
# ==12345== D1  misses:      12,345,678  ( 10.2% of all refs)
# ==12345== LL misses:        1,234,567  (  1.0% of all refs)

Interpretation: high D1 misses (L1 data) or LL misses (last-level → DRAM) mean you should improve locality or layout.

Memcheck: leaks and memory errors

valgrind --tool=memcheck --leak-check=full ./myapp
valgrind --tool=memcheck --leak-check=full --log-file=memcheck.log ./myapp

Categories: definitely lost (must fix), indirectly lost, possibly lost, still reachable (often optional). Reports include file and line.


5. Intel VTune: CPU pipeline analysis

Install VTune (Linux)

# Intel oneAPI includes VTune — download from Intel
# Ubuntu example: sudo apt install intel-oneapi-vtune
# source /opt/intel/oneapi/setvars.sh

VTune from the CLI

vtune -collect hotspots -result-dir vtune_result -- ./myapp
vtune -collect uarch-exploration -result-dir vtune_cache -- ./myapp
vtune -collect memory-access -result-dir vtune_mem -- ./myapp
vtune -report summary -result-dir vtune_result
vtune -report hotspots -result-dir vtune_result

Sample VTune-style summary

Hotspots by CPU Time:
  Function                    CPU Time    Module
  processData()               45.2%       myapp
  loadFile()                  28.1%       myapp
  parseJson()                 12.3%       myapp
Top Micro-architectural Issues:
  - L1 Data Cache Misses: 15.2%  ← improve access patterns
  - Branch Mispredictions: 3.1%

6. Tracy: real-time profiler

What Tracy is

Tracy targets games and real-time apps: insert zones in code, connect the Tracy UI while running, and inspect per-frame latency on a timeline.

Tracy with CMake

include(FetchContent)
FetchContent_Declare(
    tracy
    GIT_REPOSITORY https://github.com/wolfpld/tracy.git
    GIT_TAG v0.10
)
FetchContent_MakeAvailable(tracy)
add_executable(myapp main.cpp)
target_link_libraries(myapp PRIVATE Tracy::TracyClient)
target_compile_definitions(myapp PRIVATE TRACY_ENABLE=1)

Zones and frames

#include <tracy/Tracy.hpp>
void processData(std::vector<int>& data) {
    ZoneScoped;
    for (size_t i = 0; i < data.size(); ++i) {
        ZoneScopedN("ProcessItem");
        data[i] = data[i] * 2 + 1;
    }
}
void loadFile(const std::string& path) {
    ZoneScopedN("LoadFile");
}
int main() {
    while (running) {
        FrameMark;
        { ZoneScopedN("Update"); update(); }
        { ZoneScopedN("Physics"); physicsStep(); }
        { ZoneScopedN("Render"); render(); }
    }
}

Note: call FrameMark every frame so the UI can separate frames.

Running Tracy

# Download Tracy profiler from GitHub releases
# Run app built with TRACY_ENABLE=1, open profiler, click Connect
# Default: 127.0.0.1:8086

7. Full benchmark example

Target C++ program

// profile_target.cpp — sample workload for perf, VTune, Tracy
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>
#include <iostream>
#ifdef TRACY_ENABLE
#include <tracy/Tracy.hpp>
#endif
// Intentionally cache-unfriendly stride access
void processDataCacheUnfriendly(std::vector<int>& data) {
#ifdef TRACY_ENABLE
    ZoneScopedN("ProcessCacheUnfriendly");
#endif
    const size_t stride = 16;
    for (size_t i = 0; i < data.size(); i += stride) {
        data[i] = data[i] * 2 + 1;
    }
}
void processDataCacheFriendly(std::vector<int>& data) {
#ifdef TRACY_ENABLE
    ZoneScopedN("ProcessCacheFriendly");
#endif
    for (size_t i = 0; i < data.size(); ++i) {
        data[i] = data[i] * 2 + 1;
    }
}
void sortData(std::vector<int>& data) {
#ifdef TRACY_ENABLE
    ZoneScopedN("SortData");
#endif
    std::sort(data.begin(), data.end());
}
void fillRandom(std::vector<int>& data) {
#ifdef TRACY_ENABLE
    ZoneScopedN("FillRandom");
#endif
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(1, 1000000);
    for (auto& v : data) {
        v = dis(gen);
    }
}
int main() {
    const size_t N = 10'000'000;
    std::vector<int> data(N);
    {
#ifdef TRACY_ENABLE
        ZoneScopedN("FillRandom");
#endif
        fillRandom(data);
    }
    {
#ifdef TRACY_ENABLE
        ZoneScopedN("SortData");
#endif
        sortData(data);
    }
    {
#ifdef TRACY_ENABLE
        ZoneScopedN("ProcessCacheUnfriendly");
#endif
        processDataCacheUnfriendly(data);
    }
    {
#ifdef TRACY_ENABLE
        ZoneScopedN("ProcessCacheFriendly");
#endif
        processDataCacheFriendly(data);
    }
#ifdef TRACY_ENABLE
    FrameMark;
#endif
    return 0;
}

Build and run

g++ -std=c++17 -O2 -g -o profile_target profile_target.cpp
g++ -std=c++17 -O2 -pg -g -o profile_target_gprof profile_target.cpp
./profile_target_gprof
gprof profile_target_gprof gmon.out
g++ -std=c++17 -O0 -g -o profile_target_valgrind profile_target.cpp
valgrind --tool=callgrind ./profile_target_valgrind
valgrind --tool=cachegrind ./profile_target_valgrind
valgrind --tool=memcheck --leak-check=full ./profile_target_valgrind
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -DTRACY_ENABLE=1
perf record -F 99 -g ./profile_target
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

8. How to read flame graphs

Structure

flowchart TB
    subgraph Flame["Flame graph (width = CPU share)"]
        direction TB
        M[main - 100%]
        M --> F[fillRandom - 35%]
        M --> S[sortData - 45%]
        M --> P[processData - 20%]
        S --> S1[std::sort - 40%]
        S --> S2[comparator - 5%]
        P --> P1[loop - 18%]
        P --> P2[other - 2%]
    end

How to read:

  • Width: fraction of sampled CPU time — wider means hotter.
  • Vertical stack: caller below, callee above (mainsortDatastd::sort).
  • Wide bars: optimize these first.

Common patterns

PatternMeaningMitigation
Wide memcpyCopy-boundPools, zero-copy
Wide malloc/freeAllocation costArenas, pools
Wide std::sortSort costAvoid full sort, partial sort
Wide pthread_mutex_lockLock waitLess locking, lock-free where safe

Full flame graph command sequence

git clone --depth 1 https://github.com/brendangregg/FlameGraph
export PATH="$PATH:$(pwd)/FlameGraph"
perf record -F 99 -g --call-graph dwarf,8192 ./profile_target
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
open flamegraph.svg

9. Common issues and fixes

Issue 1: ??? symbols in perf report

Cause: Missing debug symbols or failed stack unwinding.

g++ -std=c++17 -O2 -g -o myapp main.cpp
perf record -F 99 --call-graph dwarf,8192 ./myapp
perf report -v
__attribute__((noinline)) void criticalPath() {
}

Issue 2: perf “Permission denied”

cat /proc/sys/kernel/perf_event_paranoid
sudo sysctl -w kernel.perf_event_paranoid=-1
echo "kernel.perf_event_paranoid = -1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Issue 3: VTune “Unable to attach”

echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
sudo modprobe sep
source /opt/intel/oneapi/setvars.sh

Issue 4: Tracy will not connect

#ifdef TRACY_ENABLE
#endif
netstat -an | grep 8086
sudo ufw allow 8086/udp

Issue 5: perf stat “Events not found”

perf stat -e cycles,instructions,cache-misses ./myapp
perf list

Issue 6: Program runs 10× slower while profiling

Use shorter Valgrind runs; lower perf frequency: perf record -F 49 -g ./myapp.

Issue 7: No gmon.out from gprof

Ensure -pg on compile and link; exit cleanly (return 0 / exit(0)).

Issue 8: Memcheck “Invalid read/write”

Initialize memory — use int buffer[100]{} or std::vector<int>(100, 0).

Issue 9: Only “still reachable” from Memcheck

See the table in section 4 — definitely lost is the urgent class.

Issue 10: Empty flame graph

perf record -F 99 -g --call-graph dwarf,8192 ./myapp
perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > out.svg

10. Profiling benchmark comparison

Cache-friendly vs unfriendly (same example, N = 10M)

FunctionTimeMiss rateIPC
processDataCacheFriendly12 ms2.1%2.8
processDataCacheUnfriendly89 ms18.3%0.4

Same arithmetic, 7×+ difference from access pattern alone.

Tool overhead (illustrative)

ToolConfigOverheadSlowdown factor
None0%1.00×
gprof -pg5–15%~1.05–1.15×
perf -F 99default~2%~1.02×
perf -F 999high rate~8%~1.08×
VTune hotspotsdefault~10%~1.10×
Tracy zonesdefault~0.5%~1.005×
Valgrind callgrindvery high~10–50×
Valgrind cachegrindvery high~5–20×

Sampling math

samples ≈ runtime_seconds × Hz
Example: 10 s at 99 Hz → ~990 samples
If a function is ~50% of time → ~495 samples — often enough
Short runs (<1 s): consider 999 Hz
Long runs (>10 s): 99 Hz is often fine

11. Profiling best practices

1. Measure, do not guess

❌ “This function feels slow” → optimize immediately
✅ Use perf/gprof to list top functions, then optimize real hotspots

2. Establish a baseline

perf stat -r 5 ./myapp
time ./myapp

3. Change one thing at a time

Multiple simultaneous edits make attribution impossible.

4. Right tool for the job

GoalPreferAvoid
CPU hotspotsperf + flame graphsValgrind for CPU
LeaksMemcheckperf for leaks
Cache behaviorCachegrind / perf statgprof for cache
Frame latencyTracyperf alone
Legacy systemsgprof

5. Separate profile and release builds

#ifdef TRACY_ENABLE
    ZoneScopedN("CriticalSection");
#endif

6. Enough samples

Short runs may need higher frequency or longer duration.

7. Control the environment

sudo cpupower frequency-set -g performance

12. Production profiling patterns

Pattern 1: perf sampling in production

flowchart LR
    A[Prod server] --> B[perf record -F 49]
    B --> C[Collect ~30s]
    C --> D[Save perf.data]
    D --> E[Copy to dev machine]
    E --> F[perf report / flame graph]
perf record -F 49 -g -o /tmp/perf.data -- sleep 30 &
perf record -F 49 -g -p $(pgrep myapp) -o /tmp/perf.data -- sleep 30
scp server:/tmp/perf.data .
perf report -i perf.data

Pattern 2: Scheduled profiling

#!/bin/bash
OUT_DIR="/var/log/profiles"
mkdir -p "$OUT_DIR"
DATE=$(date +%Y%m%d_%H%M%S)
PID=$(pgrep -f myapp | head -1)
if [ -n "$PID" ]; then
    perf record -F 49 -g -p "$PID" -o "$OUT_DIR/perf_$DATE.data" -- sleep 60
fi
0 3 * * * /opt/scripts/profile_production.sh

Pattern 3: Conditional Tracy

#ifdef TRACY_ENABLE
    #define PROFILE_SCOPE(name) ZoneScopedN(name)
    #define PROFILE_FRAME() FrameMark
#else
    #define PROFILE_SCOPE(name) ((void)0)
    #define PROFILE_FRAME() ((void)0)
#endif

Pattern 4: Baseline microbenchmark

#include <chrono>
#include <iostream>
int main() {
    const int iterations = 100;
    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < iterations; ++i) {
        runWorkload();
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    std::cout << "Baseline: " << (ms / double(iterations)) << " ms/iter\n";
    return 0;
}

Pattern 5: Valgrind on short workloads / tests

valgrind --tool=memcheck --leak-check=full ./run_tests
# CI: --error-exitcode=1

Production checklist:

  • perf at 49–99 Hz (≈1–5% overhead)
  • Tracy enabled only in dev/staging (TRACY_ENABLE=1)
  • VTune on staging (higher overhead)
  • No Valgrind in production (massive slowdown)
  • Watch disk — perf.data can be hundreds of MB

13. Checklists

perf

  • Build with -g
  • perf record -F 99 -g or --call-graph dwarf
  • Check perf_event_paranoid
  • Generate flame graphs with FlameGraph scripts
  • Use perf stat for IPC and cache metrics

gprof

  • -pg -g compile and link
  • Clean exit for gmon.out
  • gprof -p flat, gprof -q graph
  • Know limits around inlining and .so

Valgrind

  • Debug symbols for line numbers
  • Callgrind + KCachegrind when needed
  • Cachegrind for miss rates
  • Memcheck with --leak-check=full
  • Keep runs short

VTune

  • Intel CPU environment
  • oneAPI / VTune installed, setvars.sh
  • hotspots → microarchitecture → memory-access as needed
  • ptrace_scope if attach fails

Tracy

  • CMake FetchContent for Tracy
  • ZoneScoped / ZoneScopedN / FrameMark
  • Build with TRACY_ENABLE=1 when profiling
  • Disable in production builds when appropriate

Workflow

  • perf for quick CPU picture
  • Flame graph for visualization
  • Memcheck if leaks suspected
  • Cachegrind or perf stat if cache suspected
  • VTune for deep microarchitectural analysis
  • Tracy for frame-level real-time view
  • Re-measure after each change

Summary

ItemRole
perfLinux standard, flame graphs, low overhead, production sampling
gprofFlat profile and call graph with -pg, legacy environments
ValgrindCallgrind (CPU), Cachegrind (cache), Memcheck (memory); very slow
VTuneDeep analysis on Intel CPUs
TracyReal-time frame profiling for games and interactive apps
Flame graphsWidth = share of time; wide = optimize first
Productionperf -F 49~99, Tracy off or staging-only, periodic sampling

Principles:

  1. Measure before optimizing.
  2. Start with perf; add VTune or Tracy when needed.
  3. Use flame graphs to pick the widest bars first.
  4. In production, prefer low-frequency sampling and no Valgrind.

FAQ

When do I use this in practice?

For the first step of performance work: bottlenecks, CPU/memory behavior, cache misses, and multithreaded contention. Follow the examples and selection guides above.

Which tool should I pick?

Linux server CPU: perf. Memory leaks: Memcheck. Cache simulation: Cachegrind. Deep Intel analysis: VTune. Games / real-time: Tracy. Legacy / simple: gprof.

Is production profiling OK?

perf is often acceptable at 1–5% overhead. Prefer VTune and Tracy in dev/staging. See production patterns above.

Where can I read more?


One-line summary: Use perf, gprof, Valgrind, VTune, and Tracy to find bottlenecks, visualize them with flame graphs, analyze memory and cache behavior, and sample safely in production.


  • C++ SIMD optimization (SSE/AVX2/NEON) [#51-2]
  • C++ cache optimization guide
  • C++ thread pool guide [#51-3]
  • C++ profiling basics
  • C++ benchmarking
  • Stack vs heap in C++
  • C++ memory leaks
  • C++ Valgrind guide