[2026] C++ Advanced Profiling Guide | perf, gprof, Valgrind, VTune, Tracy [#51-1]
이 글의 핵심
When your multithreaded C++ game server burns 60% CPU and you cannot find the bottleneck: master perf, gprof, Valgrind (Callgrind, Cachegrind, Memcheck), VTune, Tracy, flame graphs, and cache analysis with practical commands and benchmarks.
Introduction: “Our multithreaded server uses 60% CPU and we do not know where”
An everyday analogy: concurrency is like one cook switching between pots; parallelism is like several cooks working on different dishes at once.
Problem scenarios
Situations you actually hit:
- Game server uses 5 of 8 cores at 100%, but you do not know which function is hot
- perf report shows ??? for symbols, so you cannot analyze
- People say “lots of cache misses” but you do not know how to measure
- You want per-frame latency in real time, but gprof cannot do that
- You suspect a memory leak but cannot trace where it comes from
- gprof’s call graph is said to be inaccurate
- Valgrind runs 30× slower and feels impractical
More scenarios: API server at 100% CPU with unknown handler; memory grows 2 GB → 8 GB over 24 hours (leak?); O(n) work gets slower than linear as n grows (suspect cache).
Beyond basic profiling (cpp-series-15-1):
- Advanced perf: flame graphs, cache events, reading stacks
- Intel VTune: CPU pipeline, memory bandwidth, thread synchronization
- Tracy: real-time frame profiling tuned for games and interactive apps
After reading this article you will be able to:
- Build flame graphs with perf and see bottlenecks visually
- Use gprof for call graphs and flat profiles (limits included)
- Use Valgrind (Callgrind, Cachegrind, Memcheck) for memory and cache work
- Quantify cache misses and branch mispredictions with VTune
- Monitor per-frame latency in real time with Tracy
- Apply safer sampling patterns in production
Expected environment: C++17 or newer, Linux (perf), Intel CPU (VTune), CMake (Tracy)
Experience from real projects: this article is based on real bottlenecks and fixes from large C++ codebases, including pitfalls and debugging tips you rarely see in textbooks.
Table of contents
- Problem scenarios and tool choice
- Advanced perf: flame graphs and cache profiling
- gprof: call graph and flat profile
- Valgrind: Callgrind, Cachegrind, Memcheck
- Intel VTune: CPU pipeline analysis
- Tracy: real-time profiler
- Full benchmark example
- How to read flame graphs
- Common issues and fixes
- Profiling benchmark comparison
- Profiling best practices
- Production profiling patterns
- Checklists
1. Problem scenarios and tool choice
When to use which tool?
flowchart TD
A[Performance issue] --> B{Type?}
B -->|CPU bottleneck| C{Environment?}
B -->|Memory leak / errors| D[Valgrind Memcheck]
B -->|Cache efficiency| E[Valgrind Cachegrind]
C -->|Linux server| F{Intel CPU?}
C -->|Game / real-time app| G[Tracy]
F -->|Yes| H{Deep dive?}
F -->|No / AMD| I[perf]
H -->|Yes: cache / pipeline| J[Intel VTune]
H -->|No| I
I --> K[Flame graph]
G --> L[Real-time timeline]
Tool comparison
| Tool | Overhead | Production | Strengths | Weaknesses |
|---|---|---|---|---|
| perf | 1–5% | ✅ Often OK | Free, standard on Linux, flame graphs | Some events limited on AMD |
| gprof | 5–15% | △ Sometimes | Call graph, easy to enable | Inaccurate sampling, ignores inlining |
| Valgrind | 10–50× slower | ❌ No | Leaks, cache simulation | Very slow; short runs only |
| VTune | 5–15% | △ Staging | Deep cache/pipeline | Intel-only, commercial |
| Tracy | 0.1–1% | △ Optional | Real time, per frame | Requires code changes |
Profiling workflow
sequenceDiagram
participant Dev as Developer
participant Perf as perf
participant Valgrind as Valgrind
participant VTune as VTune
participant Tracy as Tracy
Dev->>Perf: 1. perf record (quick hotspot search)
Perf->>Dev: Flame graph, top functions
Dev->>Valgrind: 2. Memcheck (if leak suspected)
Valgrind->>Dev: Leak sites, bad accesses
Dev->>VTune: 3. VTune (if cache/pipeline suspected)
VTune->>Dev: Cache miss, branch prediction reports
Dev->>Tracy: 4. Tracy (real-time frame analysis)
Tracy->>Dev: Per-frame latency timeline
2. Advanced perf: flame graphs and cache profiling
Advanced perf record options
# Sampling rate: 99 Hz — default, good for finding hotspots
perf record -F 99 -g ./myapp
# 999 Hz: finer sampling (more overhead)
perf record -F 999 -g ./myapp
# Stack depth (defaults exist; tune if needed)
perf record -F 99 --call-graph dwarf,4096 ./myapp
# Event: cache misses
perf record -e cache-misses -F 99 -g ./myapp
# Only CPUs 0 and 1 (useful for multithreaded apps)
perf record -C 0,1 -F 99 -g ./myapp
Option notes:
-F 99: 99 samples per second → low overhead, usually enough for hotspots-g: collect stacks (required for flame graphs)--call-graph dwarf: unwind stacks with DWARF (more accurate)-e cache-misses: cache-miss events (L1/L2/L3)
Building a flame graph (end-to-end)
# 1. Collect perf data while the app runs
perf record -F 99 -g -- ./myapp
# 2. Install FlameGraph once
git clone https://github.com/brendangregg/FlameGraph
export PATH=$PATH:$(pwd)/FlameGraph
# 3. Build SVG flame graph
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
# 4. Open in a browser
open flamegraph.svg # macOS
xdg-open flamegraph.svg # Linux
perf stat: hardware counters
# Default stats
perf stat ./myapp
# Detailed cache counters
perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses ./myapp
# Repeat runs for averages
perf stat -r 5 ./myapp
Reading the output:
Performance counter stats for './myapp' (5 runs):
1,234.56 msec task-clock # CPU time
42 context-switches # Many → thread switching cost
0 cpu-migrations
128 page-faults
3,456,789,012 cycles
2,345,678,901 instructions # 1.52 insn per cycle
123,456,789 cache-references
12,345,678 cache-misses # 10.0% miss rate!
Key metrics:
- IPC (instructions per cycle):
instructions / cycles— above ~1.0 is generally good; below ~0.5 often means memory-bound. - Cache miss rate:
cache-misses / cache-references— above ~10% suggests revisiting access patterns.
perf annotate: line-level hotspots
perf annotate -s processData
# After perf record
perf report
# Press 'a' for annotate, 's' for symbol sort
3. gprof: call graph and flat profile
What gprof is
gprof is a flat profiler shipped with GCC. Compile with -pg, run to produce gmon.out, then inspect per-function CPU share and the call graph. Useful on legacy systems without perf, but inlined functions and shared libraries can skew results.
Full gprof workflow
# 1. Compile with -pg (can combine with optimization)
g++ -std=c++17 -O2 -pg -g -o myapp profile_target.cpp
# 2. Run (writes gmon.out)
./myapp
# 3. Flat profile (time share per function)
gprof myapp gmon.out
# 4. Call graph only
gprof -q myapp gmon.out
# 5. Flat profile only (no graph)
gprof -p myapp gmon.out
# 6. Save report to a file
gprof myapp gmon.out > gprof_report.txt
Reading gprof output
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
45.23 2.15 2.15 1 2150.00 2150.00 sortData
28.10 3.48 1.33 1 1330.00 1330.00 fillRandom
12.30 4.07 0.59 1 590.00 590.00 processDataCacheUnfriendly
10.00 4.54 0.48 1 480.00 480.00 processDataCacheFriendly
Important columns:
- % time: fraction of total time in that function
- self seconds: time spent in the function body
- calls: number of invocations
- total: time including callees
Call graph example
index % time self children called name
[1] 100.0 0.00 4.54 main [1]
2.15 0.00 1/1 sortData [2]
1.33 0.00 1/1 fillRandom [3]
[2] 47.4 2.15 0.00 1 sortData [2]
Limits and alternatives
| Limit | What happens | Alternative |
|---|---|---|
| Inlining ignored | With -O2, inlined cost rolls into parents | perf with DWARF stacks |
| Shared libraries | .so internals can be fuzzy | perf, VTune |
| Multithreading | Not split per thread | perf -C, VTune threading |
| Fixed sampling | Short functions may be missed | Tune perf -F |
4. Valgrind: Callgrind, Cachegrind, Memcheck
What Valgrind is
Valgrind uses dynamic binary instrumentation: it runs your program on a synthetic CPU to analyze memory, cache, and calls in detail. Expect 10–50× slowdown — use short runs or unit tests.
Valgrind tools compared
| Tool | Role | Output |
|---|---|---|
| Callgrind | CPU profiling, call counts | callgrind.out.*, visualize in KCachegrind |
| Cachegrind | L1/L2/L3 miss simulation | Cache statistics |
| Memcheck | Leaks, invalid access | Reports with file:line |
Callgrind: CPU profiling
valgrind --tool=callgrind ./myapp
# Output: callgrind.out.<pid>
# qcachegrind callgrind.out.12345
callgrind_annotate callgrind.out.12345
callgrind_annotate --inclusive=yes callgrind.out.12345 | head -80
Reading output: callgrind_annotate shows instructions retired (Ir) per function — the top entries are usual suspects.
Cachegrind: cache misses
valgrind --tool=cachegrind ./myapp
# Example lines:
# ==12345== D1 misses: 12,345,678 ( 10.2% of all refs)
# ==12345== LL misses: 1,234,567 ( 1.0% of all refs)
Interpretation: high D1 misses (L1 data) or LL misses (last-level → DRAM) mean you should improve locality or layout.
Memcheck: leaks and memory errors
valgrind --tool=memcheck --leak-check=full ./myapp
valgrind --tool=memcheck --leak-check=full --log-file=memcheck.log ./myapp
Categories: definitely lost (must fix), indirectly lost, possibly lost, still reachable (often optional). Reports include file and line.
5. Intel VTune: CPU pipeline analysis
Install VTune (Linux)
# Intel oneAPI includes VTune — download from Intel
# Ubuntu example: sudo apt install intel-oneapi-vtune
# source /opt/intel/oneapi/setvars.sh
VTune from the CLI
vtune -collect hotspots -result-dir vtune_result -- ./myapp
vtune -collect uarch-exploration -result-dir vtune_cache -- ./myapp
vtune -collect memory-access -result-dir vtune_mem -- ./myapp
vtune -report summary -result-dir vtune_result
vtune -report hotspots -result-dir vtune_result
Sample VTune-style summary
Hotspots by CPU Time:
Function CPU Time Module
processData() 45.2% myapp
loadFile() 28.1% myapp
parseJson() 12.3% myapp
Top Micro-architectural Issues:
- L1 Data Cache Misses: 15.2% ← improve access patterns
- Branch Mispredictions: 3.1%
6. Tracy: real-time profiler
What Tracy is
Tracy targets games and real-time apps: insert zones in code, connect the Tracy UI while running, and inspect per-frame latency on a timeline.
Tracy with CMake
include(FetchContent)
FetchContent_Declare(
tracy
GIT_REPOSITORY https://github.com/wolfpld/tracy.git
GIT_TAG v0.10
)
FetchContent_MakeAvailable(tracy)
add_executable(myapp main.cpp)
target_link_libraries(myapp PRIVATE Tracy::TracyClient)
target_compile_definitions(myapp PRIVATE TRACY_ENABLE=1)
Zones and frames
#include <tracy/Tracy.hpp>
void processData(std::vector<int>& data) {
ZoneScoped;
for (size_t i = 0; i < data.size(); ++i) {
ZoneScopedN("ProcessItem");
data[i] = data[i] * 2 + 1;
}
}
void loadFile(const std::string& path) {
ZoneScopedN("LoadFile");
}
int main() {
while (running) {
FrameMark;
{ ZoneScopedN("Update"); update(); }
{ ZoneScopedN("Physics"); physicsStep(); }
{ ZoneScopedN("Render"); render(); }
}
}
Note: call FrameMark every frame so the UI can separate frames.
Running Tracy
# Download Tracy profiler from GitHub releases
# Run app built with TRACY_ENABLE=1, open profiler, click Connect
# Default: 127.0.0.1:8086
7. Full benchmark example
Target C++ program
// profile_target.cpp — sample workload for perf, VTune, Tracy
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>
#include <iostream>
#ifdef TRACY_ENABLE
#include <tracy/Tracy.hpp>
#endif
// Intentionally cache-unfriendly stride access
void processDataCacheUnfriendly(std::vector<int>& data) {
#ifdef TRACY_ENABLE
ZoneScopedN("ProcessCacheUnfriendly");
#endif
const size_t stride = 16;
for (size_t i = 0; i < data.size(); i += stride) {
data[i] = data[i] * 2 + 1;
}
}
void processDataCacheFriendly(std::vector<int>& data) {
#ifdef TRACY_ENABLE
ZoneScopedN("ProcessCacheFriendly");
#endif
for (size_t i = 0; i < data.size(); ++i) {
data[i] = data[i] * 2 + 1;
}
}
void sortData(std::vector<int>& data) {
#ifdef TRACY_ENABLE
ZoneScopedN("SortData");
#endif
std::sort(data.begin(), data.end());
}
void fillRandom(std::vector<int>& data) {
#ifdef TRACY_ENABLE
ZoneScopedN("FillRandom");
#endif
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(1, 1000000);
for (auto& v : data) {
v = dis(gen);
}
}
int main() {
const size_t N = 10'000'000;
std::vector<int> data(N);
{
#ifdef TRACY_ENABLE
ZoneScopedN("FillRandom");
#endif
fillRandom(data);
}
{
#ifdef TRACY_ENABLE
ZoneScopedN("SortData");
#endif
sortData(data);
}
{
#ifdef TRACY_ENABLE
ZoneScopedN("ProcessCacheUnfriendly");
#endif
processDataCacheUnfriendly(data);
}
{
#ifdef TRACY_ENABLE
ZoneScopedN("ProcessCacheFriendly");
#endif
processDataCacheFriendly(data);
}
#ifdef TRACY_ENABLE
FrameMark;
#endif
return 0;
}
Build and run
g++ -std=c++17 -O2 -g -o profile_target profile_target.cpp
g++ -std=c++17 -O2 -pg -g -o profile_target_gprof profile_target.cpp
./profile_target_gprof
gprof profile_target_gprof gmon.out
g++ -std=c++17 -O0 -g -o profile_target_valgrind profile_target.cpp
valgrind --tool=callgrind ./profile_target_valgrind
valgrind --tool=cachegrind ./profile_target_valgrind
valgrind --tool=memcheck --leak-check=full ./profile_target_valgrind
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -DTRACY_ENABLE=1
perf record -F 99 -g ./profile_target
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
8. How to read flame graphs
Structure
flowchart TB
subgraph Flame["Flame graph (width = CPU share)"]
direction TB
M[main - 100%]
M --> F[fillRandom - 35%]
M --> S[sortData - 45%]
M --> P[processData - 20%]
S --> S1[std::sort - 40%]
S --> S2[comparator - 5%]
P --> P1[loop - 18%]
P --> P2[other - 2%]
end
How to read:
- Width: fraction of sampled CPU time — wider means hotter.
- Vertical stack: caller below, callee above (
main→sortData→std::sort). - Wide bars: optimize these first.
Common patterns
| Pattern | Meaning | Mitigation |
|---|---|---|
Wide memcpy | Copy-bound | Pools, zero-copy |
Wide malloc/free | Allocation cost | Arenas, pools |
Wide std::sort | Sort cost | Avoid full sort, partial sort |
Wide pthread_mutex_lock | Lock wait | Less locking, lock-free where safe |
Full flame graph command sequence
git clone --depth 1 https://github.com/brendangregg/FlameGraph
export PATH="$PATH:$(pwd)/FlameGraph"
perf record -F 99 -g --call-graph dwarf,8192 ./profile_target
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
open flamegraph.svg
9. Common issues and fixes
Issue 1: ??? symbols in perf report
Cause: Missing debug symbols or failed stack unwinding.
g++ -std=c++17 -O2 -g -o myapp main.cpp
perf record -F 99 --call-graph dwarf,8192 ./myapp
perf report -v
__attribute__((noinline)) void criticalPath() {
}
Issue 2: perf “Permission denied”
cat /proc/sys/kernel/perf_event_paranoid
sudo sysctl -w kernel.perf_event_paranoid=-1
echo "kernel.perf_event_paranoid = -1" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
Issue 3: VTune “Unable to attach”
echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
sudo modprobe sep
source /opt/intel/oneapi/setvars.sh
Issue 4: Tracy will not connect
#ifdef TRACY_ENABLE
#endif
netstat -an | grep 8086
sudo ufw allow 8086/udp
Issue 5: perf stat “Events not found”
perf stat -e cycles,instructions,cache-misses ./myapp
perf list
Issue 6: Program runs 10× slower while profiling
Use shorter Valgrind runs; lower perf frequency: perf record -F 49 -g ./myapp.
Issue 7: No gmon.out from gprof
Ensure -pg on compile and link; exit cleanly (return 0 / exit(0)).
Issue 8: Memcheck “Invalid read/write”
Initialize memory — use int buffer[100]{} or std::vector<int>(100, 0).
Issue 9: Only “still reachable” from Memcheck
See the table in section 4 — definitely lost is the urgent class.
Issue 10: Empty flame graph
perf record -F 99 -g --call-graph dwarf,8192 ./myapp
perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > out.svg
10. Profiling benchmark comparison
Cache-friendly vs unfriendly (same example, N = 10M)
| Function | Time | Miss rate | IPC |
|---|---|---|---|
| processDataCacheFriendly | 12 ms | 2.1% | 2.8 |
| processDataCacheUnfriendly | 89 ms | 18.3% | 0.4 |
Same arithmetic, 7×+ difference from access pattern alone.
Tool overhead (illustrative)
| Tool | Config | Overhead | Slowdown factor |
|---|---|---|---|
| None | — | 0% | 1.00× |
gprof -pg | — | 5–15% | ~1.05–1.15× |
perf -F 99 | default | ~2% | ~1.02× |
perf -F 999 | high rate | ~8% | ~1.08× |
| VTune hotspots | default | ~10% | ~1.10× |
| Tracy zones | default | ~0.5% | ~1.005× |
| Valgrind callgrind | — | very high | ~10–50× |
| Valgrind cachegrind | — | very high | ~5–20× |
Sampling math
samples ≈ runtime_seconds × Hz
Example: 10 s at 99 Hz → ~990 samples
If a function is ~50% of time → ~495 samples — often enough
Short runs (<1 s): consider 999 Hz
Long runs (>10 s): 99 Hz is often fine
11. Profiling best practices
1. Measure, do not guess
❌ “This function feels slow” → optimize immediately
✅ Use perf/gprof to list top functions, then optimize real hotspots
2. Establish a baseline
perf stat -r 5 ./myapp
time ./myapp
3. Change one thing at a time
Multiple simultaneous edits make attribution impossible.
4. Right tool for the job
| Goal | Prefer | Avoid |
|---|---|---|
| CPU hotspots | perf + flame graphs | Valgrind for CPU |
| Leaks | Memcheck | perf for leaks |
| Cache behavior | Cachegrind / perf stat | gprof for cache |
| Frame latency | Tracy | perf alone |
| Legacy systems | gprof | — |
5. Separate profile and release builds
#ifdef TRACY_ENABLE
ZoneScopedN("CriticalSection");
#endif
6. Enough samples
Short runs may need higher frequency or longer duration.
7. Control the environment
sudo cpupower frequency-set -g performance
12. Production profiling patterns
Pattern 1: perf sampling in production
flowchart LR
A[Prod server] --> B[perf record -F 49]
B --> C[Collect ~30s]
C --> D[Save perf.data]
D --> E[Copy to dev machine]
E --> F[perf report / flame graph]
perf record -F 49 -g -o /tmp/perf.data -- sleep 30 &
perf record -F 49 -g -p $(pgrep myapp) -o /tmp/perf.data -- sleep 30
scp server:/tmp/perf.data .
perf report -i perf.data
Pattern 2: Scheduled profiling
#!/bin/bash
OUT_DIR="/var/log/profiles"
mkdir -p "$OUT_DIR"
DATE=$(date +%Y%m%d_%H%M%S)
PID=$(pgrep -f myapp | head -1)
if [ -n "$PID" ]; then
perf record -F 49 -g -p "$PID" -o "$OUT_DIR/perf_$DATE.data" -- sleep 60
fi
0 3 * * * /opt/scripts/profile_production.sh
Pattern 3: Conditional Tracy
#ifdef TRACY_ENABLE
#define PROFILE_SCOPE(name) ZoneScopedN(name)
#define PROFILE_FRAME() FrameMark
#else
#define PROFILE_SCOPE(name) ((void)0)
#define PROFILE_FRAME() ((void)0)
#endif
Pattern 4: Baseline microbenchmark
#include <chrono>
#include <iostream>
int main() {
const int iterations = 100;
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < iterations; ++i) {
runWorkload();
}
auto end = std::chrono::high_resolution_clock::now();
auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << "Baseline: " << (ms / double(iterations)) << " ms/iter\n";
return 0;
}
Pattern 5: Valgrind on short workloads / tests
valgrind --tool=memcheck --leak-check=full ./run_tests
# CI: --error-exitcode=1
Production checklist:
- perf at 49–99 Hz (≈1–5% overhead)
- Tracy enabled only in dev/staging (
TRACY_ENABLE=1) - VTune on staging (higher overhead)
- No Valgrind in production (massive slowdown)
- Watch disk —
perf.datacan be hundreds of MB
13. Checklists
perf
- Build with
-g -
perf record -F 99 -gor--call-graph dwarf - Check
perf_event_paranoid - Generate flame graphs with FlameGraph scripts
- Use
perf statfor IPC and cache metrics
gprof
-
-pg -gcompile and link - Clean exit for
gmon.out -
gprof -pflat,gprof -qgraph - Know limits around inlining and
.so
Valgrind
- Debug symbols for line numbers
- Callgrind + KCachegrind when needed
- Cachegrind for miss rates
- Memcheck with
--leak-check=full - Keep runs short
VTune
- Intel CPU environment
- oneAPI / VTune installed,
setvars.sh - hotspots → microarchitecture → memory-access as needed
-
ptrace_scopeif attach fails
Tracy
- CMake FetchContent for Tracy
-
ZoneScoped/ZoneScopedN/FrameMark - Build with
TRACY_ENABLE=1when profiling - Disable in production builds when appropriate
Workflow
- perf for quick CPU picture
- Flame graph for visualization
- Memcheck if leaks suspected
- Cachegrind or
perf statif cache suspected - VTune for deep microarchitectural analysis
- Tracy for frame-level real-time view
- Re-measure after each change
Summary
| Item | Role |
|---|---|
| perf | Linux standard, flame graphs, low overhead, production sampling |
| gprof | Flat profile and call graph with -pg, legacy environments |
| Valgrind | Callgrind (CPU), Cachegrind (cache), Memcheck (memory); very slow |
| VTune | Deep analysis on Intel CPUs |
| Tracy | Real-time frame profiling for games and interactive apps |
| Flame graphs | Width = share of time; wide = optimize first |
| Production | perf -F 49~99, Tracy off or staging-only, periodic sampling |
Principles:
- Measure before optimizing.
- Start with perf; add VTune or Tracy when needed.
- Use flame graphs to pick the widest bars first.
- In production, prefer low-frequency sampling and no Valgrind.
FAQ
When do I use this in practice?
For the first step of performance work: bottlenecks, CPU/memory behavior, cache misses, and multithreaded contention. Follow the examples and selection guides above.
Which tool should I pick?
Linux server CPU: perf. Memory leaks: Memcheck. Cache simulation: Cachegrind. Deep Intel analysis: VTune. Games / real-time: Tracy. Legacy / simple: gprof.
Is production profiling OK?
perf is often acceptable at 1–5% overhead. Prefer VTune and Tracy in dev/staging. See production patterns above.
Where can I read more?
One-line summary: Use perf, gprof, Valgrind, VTune, and Tracy to find bottlenecks, visualize them with flame graphs, analyze memory and cache behavior, and sample safely in production.
Related posts
- C++ SIMD optimization (SSE/AVX2/NEON) [#51-2]
- C++ cache optimization guide
- C++ thread pool guide [#51-3]
- C++ profiling basics
- C++ benchmarking
- Stack vs heap in C++
- C++ memory leaks
- C++ Valgrind guide