When do I use this in production work?

At the first step of performance work: finding bottlenecks, CPU/memory behavior, cache misses, and multithreaded hotspots. Use the tool-selection guide and production patterns in this article.

Should I use perf, gprof, Valgrind, VTune, or Tracy?

CPU hotspots: perf. Memory leaks: Valgrind Memcheck. Cache: Cachegrind. Deep Intel analysis: VTune. Real time: Tracy. Legacy/simple: gprof. See the diagrams in the article.

Is it OK to profile in production?

perf can sample in production at roughly 1–5% overhead. Prefer VTune and Tracy in dev/staging. See the production patterns section.

[2026] C++ Advanced Profiling Guide | perf, gprof, Valgrind, VTune, Tracy [#51-1]

2026년 4월 1일 · 40분 읽기 · 수정 2026년 4월 17일 advanced tutorial

이 글의 핵심

When your multithreaded C++ game server burns 60% CPU and you cannot find the bottleneck: master perf, gprof, Valgrind (Callgrind, Cachegrind, Memcheck), VTune, Tracy, flame graphs, and cache analysis with practical commands and benchmarks.

Introduction: “Our multithreaded server uses 60% CPU and we do not know where”

An everyday analogy: concurrency is like one cook switching between pots; parallelism is like several cooks working on different dishes at once.

Problem scenarios

Situations you actually hit:
- Game server uses 5 of 8 cores at 100%, but you do not know which function is hot
- perf report shows ??? for symbols, so you cannot analyze
- People say “lots of cache misses” but you do not know how to measure
- You want per-frame latency in real time, but gprof cannot do that
- You suspect a memory leak but cannot trace where it comes from
- gprof’s call graph is said to be inaccurate
- Valgrind runs 30× slower and feels impractical

More scenarios: API server at 100% CPU with unknown handler; memory grows 2 GB → 8 GB over 24 hours (leak?); O(n) work gets slower than linear as n grows (suspect cache).

Beyond basic profiling (cpp-series-15-1):

Advanced perf: flame graphs, cache events, reading stacks
Intel VTune: CPU pipeline, memory bandwidth, thread synchronization
Tracy: real-time frame profiling tuned for games and interactive apps

After reading this article you will be able to:

Build flame graphs with perf and see bottlenecks visually
Use gprof for call graphs and flat profiles (limits included)
Use Valgrind (Callgrind, Cachegrind, Memcheck) for memory and cache work
Quantify cache misses and branch mispredictions with VTune
Monitor per-frame latency in real time with Tracy
Apply safer sampling patterns in production

Expected environment: C++17 or newer, Linux (perf), Intel CPU (VTune), CMake (Tracy)

Experience from real projects: this article is based on real bottlenecks and fixes from large C++ codebases, including pitfalls and debugging tips you rarely see in textbooks.

Problem scenarios and tool choice
Advanced perf: flame graphs and cache profiling
gprof: call graph and flat profile
Valgrind: Callgrind, Cachegrind, Memcheck
Intel VTune: CPU pipeline analysis
Tracy: real-time profiler
Full benchmark example
How to read flame graphs
Common issues and fixes
Profiling benchmark comparison
Profiling best practices
Production profiling patterns
Checklists

1. Problem scenarios and tool choice

When to use which tool?

flowchart TD
    A[Performance issue] --> B{Type?}
    B -->|CPU bottleneck| C{Environment?}
    B -->|Memory leak / errors| D[Valgrind Memcheck]
    B -->|Cache efficiency| E[Valgrind Cachegrind]
    C -->|Linux server| F{Intel CPU?}
    C -->|Game / real-time app| G[Tracy]
    F -->|Yes| H{Deep dive?}
    F -->|No / AMD| I[perf]
    H -->|Yes: cache / pipeline| J[Intel VTune]
    H -->|No| I
    I --> K[Flame graph]
    G --> L[Real-time timeline]

Tool comparison

Tool	Overhead	Production	Strengths	Weaknesses
perf	1–5%	✅ Often OK	Free, standard on Linux, flame graphs	Some events limited on AMD
gprof	5–15%	△ Sometimes	Call graph, easy to enable	Inaccurate sampling, ignores inlining
Valgrind	10–50× slower	❌ No	Leaks, cache simulation	Very slow; short runs only
VTune	5–15%	△ Staging	Deep cache/pipeline	Intel-only, commercial
Tracy	0.1–1%	△ Optional	Real time, per frame	Requires code changes

Profiling workflow

sequenceDiagram
    participant Dev as Developer
    participant Perf as perf
    participant Valgrind as Valgrind
    participant VTune as VTune
    participant Tracy as Tracy
    Dev->>Perf: 1. perf record (quick hotspot search)
    Perf->>Dev: Flame graph, top functions
    Dev->>Valgrind: 2. Memcheck (if leak suspected)
    Valgrind->>Dev: Leak sites, bad accesses
    Dev->>VTune: 3. VTune (if cache/pipeline suspected)
    VTune->>Dev: Cache miss, branch prediction reports
    Dev->>Tracy: 4. Tracy (real-time frame analysis)
    Tracy->>Dev: Per-frame latency timeline

2. Advanced perf: flame graphs and cache profiling

Advanced perf record options

# Sampling rate: 99 Hz — default, good for finding hotspots
perf record -F 99 -g ./myapp
# 999 Hz: finer sampling (more overhead)
perf record -F 999 -g ./myapp
# Stack depth (defaults exist; tune if needed)
perf record -F 99 --call-graph dwarf,4096 ./myapp
# Event: cache misses
perf record -e cache-misses -F 99 -g ./myapp
# Only CPUs 0 and 1 (useful for multithreaded apps)
perf record -C 0,1 -F 99 -g ./myapp

Option notes:

-F 99: 99 samples per second → low overhead, usually enough for hotspots
-g: collect stacks (required for flame graphs)
--call-graph dwarf: unwind stacks with DWARF (more accurate)
-e cache-misses: cache-miss events (L1/L2/L3)

Building a flame graph (end-to-end)

# 1. Collect perf data while the app runs
perf record -F 99 -g -- ./myapp
# 2. Install FlameGraph once
git clone https://github.com/brendangregg/FlameGraph
export PATH=$PATH:$(pwd)/FlameGraph
# 3. Build SVG flame graph
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
# 4. Open in a browser
open flamegraph.svg   # macOS
xdg-open flamegraph.svg  # Linux

perf stat: hardware counters

# Default stats
perf stat ./myapp
# Detailed cache counters
perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses ./myapp
# Repeat runs for averages
perf stat -r 5 ./myapp

Reading the output:

 Performance counter stats for './myapp' (5 runs):
       1,234.56 msec task-clock                # CPU time
              42      context-switches        # Many → thread switching cost
               0      cpu-migrations
             128      page-faults
   3,456,789,012      cycles
   2,345,678,901      instructions            # 1.52 insn per cycle
     123,456,789      cache-references
      12,345,678      cache-misses            # 10.0% miss rate!

Key metrics:

IPC (instructions per cycle): instructions / cycles — above ~1.0 is generally good; below ~0.5 often means memory-bound.
Cache miss rate: cache-misses / cache-references — above ~10% suggests revisiting access patterns.

perf annotate: line-level hotspots

perf annotate -s processData
# After perf record
perf report
# Press 'a' for annotate, 's' for symbol sort

3. gprof: call graph and flat profile

What gprof is

gprof is a flat profiler shipped with GCC. Compile with -pg, run to produce gmon.out, then inspect per-function CPU share and the call graph. Useful on legacy systems without perf, but inlined functions and shared libraries can skew results.

Full gprof workflow

# 1. Compile with -pg (can combine with optimization)
g++ -std=c++17 -O2 -pg -g -o myapp profile_target.cpp
# 2. Run (writes gmon.out)
./myapp
# 3. Flat profile (time share per function)
gprof myapp gmon.out
# 4. Call graph only
gprof -q myapp gmon.out
# 5. Flat profile only (no graph)
gprof -p myapp gmon.out
# 6. Save report to a file
gprof myapp gmon.out > gprof_report.txt

Reading gprof output

Flat profile:
Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 45.23      2.15     2.15        1  2150.00  2150.00  sortData
 28.10      3.48     1.33        1  1330.00  1330.00  fillRandom
 12.30      4.07     0.59        1   590.00   590.00  processDataCacheUnfriendly
 10.00      4.54     0.48        1   480.00   480.00  processDataCacheFriendly

Important columns:

% time: fraction of total time in that function
self seconds: time spent in the function body
calls: number of invocations
total: time including callees

Call graph example

index % time    self  children    called     name
[1]    100.0    0.00    4.54                 main [1]
                2.15    0.00       1/1           sortData [2]
                1.33    0.00       1/1           fillRandom [3]
[2]     47.4    2.15    0.00       1         sortData [2]

Limits and alternatives

Limit	What happens	Alternative
Inlining ignored	With `-O2`, inlined cost rolls into parents	perf with DWARF stacks
Shared libraries	`.so` internals can be fuzzy	perf, VTune
Multithreading	Not split per thread	`perf -C`, VTune threading
Fixed sampling	Short functions may be missed	Tune `perf -F`

4. Valgrind: Callgrind, Cachegrind, Memcheck

What Valgrind is

Valgrind uses dynamic binary instrumentation: it runs your program on a synthetic CPU to analyze memory, cache, and calls in detail. Expect 10–50× slowdown — use short runs or unit tests.

Valgrind tools compared

Tool	Role	Output
Callgrind	CPU profiling, call counts	`callgrind.out.*`, visualize in KCachegrind
Cachegrind	L1/L2/L3 miss simulation	Cache statistics
Memcheck	Leaks, invalid access	Reports with file:line

Callgrind: CPU profiling

valgrind --tool=callgrind ./myapp
# Output: callgrind.out.<pid>
# qcachegrind callgrind.out.12345
callgrind_annotate callgrind.out.12345
callgrind_annotate --inclusive=yes callgrind.out.12345 | head -80

Reading output: callgrind_annotate shows instructions retired (Ir) per function — the top entries are usual suspects.

Cachegrind: cache misses

valgrind --tool=cachegrind ./myapp
# Example lines:
# ==12345== D1  misses:      12,345,678  ( 10.2% of all refs)
# ==12345== LL misses:        1,234,567  (  1.0% of all refs)

Interpretation: high D1 misses (L1 data) or LL misses (last-level → DRAM) mean you should improve locality or layout.

Memcheck: leaks and memory errors

valgrind --tool=memcheck --leak-check=full ./myapp
valgrind --tool=memcheck --leak-check=full --log-file=memcheck.log ./myapp

Categories: definitely lost (must fix), indirectly lost, possibly lost, still reachable (often optional). Reports include file and line.

5. Intel VTune: CPU pipeline analysis

Install VTune (Linux)

# Intel oneAPI includes VTune — download from Intel
# Ubuntu example: sudo apt install intel-oneapi-vtune
# source /opt/intel/oneapi/setvars.sh

VTune from the CLI

vtune -collect hotspots -result-dir vtune_result -- ./myapp
vtune -collect uarch-exploration -result-dir vtune_cache -- ./myapp
vtune -collect memory-access -result-dir vtune_mem -- ./myapp
vtune -report summary -result-dir vtune_result
vtune -report hotspots -result-dir vtune_result

Sample VTune-style summary

Hotspots by CPU Time:
  Function                    CPU Time    Module
  processData()               45.2%       myapp
  loadFile()                  28.1%       myapp
  parseJson()                 12.3%       myapp
Top Micro-architectural Issues:
  - L1 Data Cache Misses: 15.2%  ← improve access patterns
  - Branch Mispredictions: 3.1%

6. Tracy: real-time profiler

What Tracy is

Tracy targets games and real-time apps: insert zones in code, connect the Tracy UI while running, and inspect per-frame latency on a timeline.

Tracy with CMake

include(FetchContent)
FetchContent_Declare(
    tracy
    GIT_REPOSITORY https://github.com/wolfpld/tracy.git
    GIT_TAG v0.10
)
FetchContent_MakeAvailable(tracy)
add_executable(myapp main.cpp)
target_link_libraries(myapp PRIVATE Tracy::TracyClient)
target_compile_definitions(myapp PRIVATE TRACY_ENABLE=1)

Zones and frames

#include <tracy/Tracy.hpp>
void processData(std::vector<int>& data) {
    ZoneScoped;
    for (size_t i = 0; i < data.size(); ++i) {
        ZoneScopedN("ProcessItem");
        data[i] = data[i] * 2 + 1;
    }
}
void loadFile(const std::string& path) {
    ZoneScopedN("LoadFile");
}
int main() {
    while (running) {
        FrameMark;
        { ZoneScopedN("Update"); update(); }
        { ZoneScopedN("Physics"); physicsStep(); }
        { ZoneScopedN("Render"); render(); }
    }
}

Note: call FrameMark every frame so the UI can separate frames.

Running Tracy

# Download Tracy profiler from GitHub releases
# Run app built with TRACY_ENABLE=1, open profiler, click Connect
# Default: 127.0.0.1:8086

7. Full benchmark example

Target C++ program

// profile_target.cpp — sample workload for perf, VTune, Tracy
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>
#include <iostream>
#ifdef TRACY_ENABLE
#include <tracy/Tracy.hpp>
#endif
// Intentionally cache-unfriendly stride access
void processDataCacheUnfriendly(std::vector<int>& data) {
#ifdef TRACY_ENABLE
    ZoneScopedN("ProcessCacheUnfriendly");
#endif
    const size_t stride = 16;
    for (size_t i = 0; i < data.size(); i += stride) {
        data[i] = data[i] * 2 + 1;
    }
}
void processDataCacheFriendly(std::vector<int>& data) {
#ifdef TRACY_ENABLE
    ZoneScopedN("ProcessCacheFriendly");
#endif
    for (size_t i = 0; i < data.size(); ++i) {
        data[i] = data[i] * 2 + 1;
    }
}
void sortData(std::vector<int>& data) {
#ifdef TRACY_ENABLE
    ZoneScopedN("SortData");
#endif
    std::sort(data.begin(), data.end());
}
void fillRandom(std::vector<int>& data) {
#ifdef TRACY_ENABLE
    ZoneScopedN("FillRandom");
#endif
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(1, 1000000);
    for (auto& v : data) {
        v = dis(gen);
    }
}
int main() {
    const size_t N = 10'000'000;
    std::vector<int> data(N);
    {
#ifdef TRACY_ENABLE
        ZoneScopedN("FillRandom");
#endif
        fillRandom(data);
    }
    {
#ifdef TRACY_ENABLE
        ZoneScopedN("SortData");
#endif
        sortData(data);
    }
    {
#ifdef TRACY_ENABLE
        ZoneScopedN("ProcessCacheUnfriendly");
#endif
        processDataCacheUnfriendly(data);
    }
    {
#ifdef TRACY_ENABLE
        ZoneScopedN("ProcessCacheFriendly");
#endif
        processDataCacheFriendly(data);
    }
#ifdef TRACY_ENABLE
    FrameMark;
#endif
    return 0;
}

Build and run

g++ -std=c++17 -O2 -g -o profile_target profile_target.cpp
g++ -std=c++17 -O2 -pg -g -o profile_target_gprof profile_target.cpp
./profile_target_gprof
gprof profile_target_gprof gmon.out
g++ -std=c++17 -O0 -g -o profile_target_valgrind profile_target.cpp
valgrind --tool=callgrind ./profile_target_valgrind
valgrind --tool=cachegrind ./profile_target_valgrind
valgrind --tool=memcheck --leak-check=full ./profile_target_valgrind
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -DTRACY_ENABLE=1
perf record -F 99 -g ./profile_target
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

8. How to read flame graphs

Structure

flowchart TB
    subgraph Flame["Flame graph (width = CPU share)"]
        direction TB
        M[main - 100%]
        M --> F[fillRandom - 35%]
        M --> S[sortData - 45%]
        M --> P[processData - 20%]
        S --> S1[std::sort - 40%]
        S --> S2[comparator - 5%]
        P --> P1[loop - 18%]
        P --> P2[other - 2%]
    end

How to read:

Width: fraction of sampled CPU time — wider means hotter.
Vertical stack: caller below, callee above (main → sortData → std::sort).
Wide bars: optimize these first.

Common patterns

Pattern	Meaning	Mitigation
Wide `memcpy`	Copy-bound	Pools, zero-copy
Wide `malloc`/`free`	Allocation cost	Arenas, pools
Wide `std::sort`	Sort cost	Avoid full sort, partial sort
Wide `pthread_mutex_lock`	Lock wait	Less locking, lock-free where safe

Full flame graph command sequence

git clone --depth 1 https://github.com/brendangregg/FlameGraph
export PATH="$PATH:$(pwd)/FlameGraph"
perf record -F 99 -g --call-graph dwarf,8192 ./profile_target
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
open flamegraph.svg

9. Common issues and fixes

Issue 1: `???` symbols in perf report

Cause: Missing debug symbols or failed stack unwinding.

g++ -std=c++17 -O2 -g -o myapp main.cpp
perf record -F 99 --call-graph dwarf,8192 ./myapp
perf report -v

__attribute__((noinline)) void criticalPath() {
}

Issue 2: perf “Permission denied”

cat /proc/sys/kernel/perf_event_paranoid
sudo sysctl -w kernel.perf_event_paranoid=-1
echo "kernel.perf_event_paranoid = -1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Issue 3: VTune “Unable to attach”

echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
sudo modprobe sep
source /opt/intel/oneapi/setvars.sh

Issue 4: Tracy will not connect

#ifdef TRACY_ENABLE
#endif

netstat -an | grep 8086
sudo ufw allow 8086/udp

Issue 5: perf stat “Events not found”

perf stat -e cycles,instructions,cache-misses ./myapp
perf list

Issue 6: Program runs 10× slower while profiling

Use shorter Valgrind runs; lower perf frequency: perf record -F 49 -g ./myapp.

Issue 7: No `gmon.out` from gprof

Ensure -pg on compile and link; exit cleanly (return 0 / exit(0)).

Issue 8: Memcheck “Invalid read/write”

Initialize memory — use int buffer[100]{} or std::vector<int>(100, 0).

Issue 9: Only “still reachable” from Memcheck

See the table in section 4 — definitely lost is the urgent class.

Issue 10: Empty flame graph

perf record -F 99 -g --call-graph dwarf,8192 ./myapp
perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > out.svg

10. Profiling benchmark comparison

Cache-friendly vs unfriendly (same example, N = 10M)

Function	Time	Miss rate	IPC
processDataCacheFriendly	12 ms	2.1%	2.8
processDataCacheUnfriendly	89 ms	18.3%	0.4

Same arithmetic, 7×+ difference from access pattern alone.

Tool overhead (illustrative)

Tool	Config	Overhead	Slowdown factor
None	—	0%	1.00×
gprof `-pg`	—	5–15%	~1.05–1.15×
perf `-F 99`	default	~2%	~1.02×
perf `-F 999`	high rate	~8%	~1.08×
VTune hotspots	default	~10%	~1.10×
Tracy zones	default	~0.5%	~1.005×
Valgrind callgrind	—	very high	~10–50×
Valgrind cachegrind	—	very high	~5–20×

Sampling math

samples ≈ runtime_seconds × Hz
Example: 10 s at 99 Hz → ~990 samples
If a function is ~50% of time → ~495 samples — often enough
Short runs (<1 s): consider 999 Hz
Long runs (>10 s): 99 Hz is often fine

11. Profiling best practices

1. Measure, do not guess

❌ “This function feels slow” → optimize immediately
✅ Use perf/gprof to list top functions, then optimize real hotspots

2. Establish a baseline

perf stat -r 5 ./myapp
time ./myapp

3. Change one thing at a time

Multiple simultaneous edits make attribution impossible.

4. Right tool for the job

Goal	Prefer	Avoid
CPU hotspots	perf + flame graphs	Valgrind for CPU
Leaks	Memcheck	perf for leaks
Cache behavior	Cachegrind / `perf stat`	gprof for cache
Frame latency	Tracy	perf alone
Legacy systems	gprof	—

5. Separate profile and release builds

#ifdef TRACY_ENABLE
    ZoneScopedN("CriticalSection");
#endif

6. Enough samples

Short runs may need higher frequency or longer duration.

7. Control the environment

sudo cpupower frequency-set -g performance

12. Production profiling patterns

Pattern 1: perf sampling in production

flowchart LR
    A[Prod server] --> B[perf record -F 49]
    B --> C[Collect ~30s]
    C --> D[Save perf.data]
    D --> E[Copy to dev machine]
    E --> F[perf report / flame graph]

perf record -F 49 -g -o /tmp/perf.data -- sleep 30 &
perf record -F 49 -g -p $(pgrep myapp) -o /tmp/perf.data -- sleep 30
scp server:/tmp/perf.data .
perf report -i perf.data

Pattern 2: Scheduled profiling

#!/bin/bash
OUT_DIR="/var/log/profiles"
mkdir -p "$OUT_DIR"
DATE=$(date +%Y%m%d_%H%M%S)
PID=$(pgrep -f myapp | head -1)
if [ -n "$PID" ]; then
    perf record -F 49 -g -p "$PID" -o "$OUT_DIR/perf_$DATE.data" -- sleep 60
fi

0 3 * * * /opt/scripts/profile_production.sh

Pattern 3: Conditional Tracy

#ifdef TRACY_ENABLE
    #define PROFILE_SCOPE(name) ZoneScopedN(name)
    #define PROFILE_FRAME() FrameMark
#else
    #define PROFILE_SCOPE(name) ((void)0)
    #define PROFILE_FRAME() ((void)0)
#endif

Pattern 4: Baseline microbenchmark

#include <chrono>
#include <iostream>
int main() {
    const int iterations = 100;
    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < iterations; ++i) {
        runWorkload();
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    std::cout << "Baseline: " << (ms / double(iterations)) << " ms/iter\n";
    return 0;
}

Pattern 5: Valgrind on short workloads / tests

valgrind --tool=memcheck --leak-check=full ./run_tests
# CI: --error-exitcode=1

Production checklist:

perf at 49–99 Hz (≈1–5% overhead)
Tracy enabled only in dev/staging (TRACY_ENABLE=1)
VTune on staging (higher overhead)
No Valgrind in production (massive slowdown)
Watch disk — perf.data can be hundreds of MB

13. Checklists

perf

Build with -g
perf record -F 99 -g or --call-graph dwarf
Check perf_event_paranoid
Generate flame graphs with FlameGraph scripts
Use perf stat for IPC and cache metrics

gprof

-pg -g compile and link
Clean exit for gmon.out
gprof -p flat, gprof -q graph
Know limits around inlining and .so

Valgrind

VTune

Intel CPU environment
oneAPI / VTune installed, setvars.sh
hotspots → microarchitecture → memory-access as needed
ptrace_scope if attach fails

Tracy

CMake FetchContent for Tracy
ZoneScoped / ZoneScopedN / FrameMark
Build with TRACY_ENABLE=1 when profiling
Disable in production builds when appropriate

Workflow

perf for quick CPU picture
Flame graph for visualization
Memcheck if leaks suspected
Cachegrind or perf stat if cache suspected
VTune for deep microarchitectural analysis
Tracy for frame-level real-time view
Re-measure after each change

Summary

Item	Role
perf	Linux standard, flame graphs, low overhead, production sampling
gprof	Flat profile and call graph with `-pg`, legacy environments
Valgrind	Callgrind (CPU), Cachegrind (cache), Memcheck (memory); very slow
VTune	Deep analysis on Intel CPUs
Tracy	Real-time frame profiling for games and interactive apps
Flame graphs	Width = share of time; wide = optimize first
Production	`perf -F 49~99`, Tracy off or staging-only, periodic sampling

Principles:

Measure before optimizing.
Start with perf; add VTune or Tracy when needed.
Use flame graphs to pick the widest bars first.
In production, prefer low-frequency sampling and no Valgrind.

FAQ

When do I use this in practice?

For the first step of performance work: bottlenecks, CPU/memory behavior, cache misses, and multithreaded contention. Follow the examples and selection guides above.

Which tool should I pick?

Linux server CPU: perf. Memory leaks: Memcheck. Cache simulation: Cachegrind. Deep Intel analysis: VTune. Games / real-time: Tracy. Legacy / simple: gprof.

Is production profiling OK?

perf is often acceptable at 1–5% overhead. Prefer VTune and Tracy in dev/staging. See production patterns above.

Where can I read more?

One-line summary: Use perf, gprof, Valgrind, VTune, and Tracy to find bottlenecks, visualize them with flame graphs, analyze memory and cache behavior, and sample safely in production.

C++ SIMD optimization (SSE/AVX2/NEON) [#51-2]
C++ cache optimization guide
C++ thread pool guide [#51-3]
C++ profiling basics
C++ benchmarking
Stack vs heap in C++
C++ memory leaks
C++ Valgrind guide