[2026] C++ Performance Optimization | Practical Techniques to Run “10× Faster” [#42-1]

2026년 3월 12일 · 13분 읽기 · 수정 2026년 4월 17일 advanced tutorial

이 글의 핵심

C++ performance: “10× faster” patterns—cut copies, optimize allocation, cache-friendly data, compiler opts, SIMD, and measure with profilers.

1. Remove unnecessary copies

Pass by value vs by reference

Example process implementations:

// ❌ Slow (copy)
void process(vector<int> data) {
    // ...
}

// ✅ Fast (reference)
void process(const vector<int>& data) {
    // ...
}

// ✅ When mutation is required
void process(vector<int>& data) {
    // ...
}

Move semantics

C++ example:

// ❌ Copy
vector<int> createLargeVector() {
    vector<int> v(1000000);
    return v;  // copy may occur (older compilers / pessimization)
}

// ✅ Move
vector<int> result = createLargeVector();  // moved (C++11+)

// Explicit move
vector<int> v1 = {1, 2, 3};
vector<int> v2 = move(v1);  // v1 is left empty

2. Memory allocation optimization

Use `reserve` to avoid repeated reallocations

// ❌ Many reallocations
vector<int> v;
for (int i = 0; i < 1000; i++) {
    v.push_back(i);  // reallocates multiple times
}

// ✅ Allocate once
vector<int> v;
v.reserve(1000);  // reserve up front
for (int i = 0; i < 1000; i++) {
    v.push_back(i);
}

Object pool

template <typename T>
class ObjectPool {
private:
    vector<unique_ptr<T>> pool;

public:
    T* acquire() {
        if (pool.empty()) {
            return new T();
        }
        T* obj = pool.back().release();
        pool.pop_back();
        return obj;
    }

    void release(T* obj) {
        pool.push_back(unique_ptr<T>(obj));
    }
};

Everyday analogy: think of memory like an apartment building. The stack is like an elevator—fast but limited. The heap is like a warehouse—spacious but takes longer to “fetch” things. A pointer is a slip of paper with an address, e.g. “floor 3, unit 302.”

3. Cache-friendly code

Data locality

// ❌ Many cache misses
struct Bad {
    int id;
    char padding[60];  // wastes cache line
    int value;
};

// ✅ Cache-friendly
struct Good {
    int id;
    int value;
    // keep related fields together
};

Traversing a matrix

int matrix[1000][1000];

// ❌ Slow (poor locality for row-major layout)
for (int j = 0; j < 1000; j++) {
    for (int i = 0; i < 1000; i++) {
        matrix[i][j] = 0;
    }
}

// ✅ Fast (sequential access)
for (int i = 0; i < 1000; i++) {
    for (int j = 0; j < 1000; j++) {
        matrix[i][j] = 0;
    }
}

4. Compiler optimizations

Inline functions

Example add:

// ❌ Function call overhead (may still be inlined at -O2)
int add(int a, int b) {
    return a + b;
}

// ✅ inline hint
inline int add(int a, int b) {
    return a + b;
}

// ✅ constexpr (compile-time when possible)
constexpr int add(int a, int b) {
    return a + b;
}

Compiler flags

Run in the terminal:

# Optimization levels
g++ -O0  # no optimization
g++ -O1  # basic
g++ -O2  # commonly recommended
g++ -O3  # aggressive

# Extras
g++ -O3 -march=native  # tune for local CPU
g++ -O3 -flto          # link-time optimization

Hands-on examples

Example 1: String concatenation

#include <iostream>
#include <string>
#include <sstream>
#include <chrono>
using namespace std;

// ❌ Slow
string concat1(int n) {
    string result;
    for (int i = 0; i < n; i++) {
        result += to_string(i);  // reallocates often
    }
    return result;
}

// ✅ Faster
string concat2(int n) {
    ostringstream oss;
    for (int i = 0; i < n; i++) {
        oss << i;
    }
    return oss.str();
}

int main() {
    auto start = chrono::high_resolution_clock::now();
    concat1(10000);
    auto end = chrono::high_resolution_clock::now();
    cout << "concat1: " << chrono::duration_cast<chrono::milliseconds>(end - start).count() << "ms" << endl;

    start = chrono::high_resolution_clock::now();
    concat2(10000);
    end = chrono::high_resolution_clock::now();
    cout << "concat2: " << chrono::duration_cast<chrono::milliseconds>(end - start).count() << "ms" << endl;
}

Note: ostringstream is often much faster than repeated string += for many appends.

Example 2: Lookup table

#include <iostream>
#include <cmath>
#include <chrono>
using namespace std;

// ❌ Slow (recomputes every time)
double slow(int x) {
    return sin(x * 0.01);
}

// ✅ Fast (precomputed)
class FastSin {
private:
    static constexpr int SIZE = 360;
    double table[SIZE];

public:
    FastSin() {
        for (int i = 0; i < SIZE; i++) {
            table[i] = sin(i * 0.01);
        }
    }

    double get(int x) {
        return table[x % SIZE];
    }
};

int main() {
    FastSin fastSin;

    auto start = chrono::high_resolution_clock::now();
    for (int i = 0; i < 1000000; i++) {
        slow(i);
    }
    auto end = chrono::high_resolution_clock::now();
    cout << "slow: " << chrono::duration_cast<chrono::milliseconds>(end - start).count() << "ms" << endl;

    start = chrono::high_resolution_clock::now();
    for (int i = 0; i < 1000000; i++) {
        fastSin.get(i);
    }
    end = chrono::high_resolution_clock::now();
    cout << "fast: " << chrono::duration_cast<chrono::milliseconds>(end - start).count() << "ms" << endl;
}

Note: repetitive math can often be replaced with a lookup table (mind accuracy and memory).

Example 3: SIMD optimization

#include <immintrin.h>  // AVX
#include <iostream>
using namespace std;

// ❌ Scalar loop
void add_scalar(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

// ✅ SIMD (process 8 floats at a time; assumes n multiple of 8 in this sketch)
void add_simd(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vc = _mm256_add_ps(va, vb);
        _mm256_storeu_ps(&c[i], vc);
    }
}

Note: SIMD processes multiple lanes in parallel; add a scalar tail loop for general n.

Profiling tools

1. gprof

# Compile
g++ -pg program.cpp -o program

# Run
./program

# Inspect profile
gprof program gmon.out > analysis.txt

2. Valgrind (Callgrind)

# Profile
valgrind --tool=callgrind ./program

# View results
kcachegrind callgrind.out.*

3. perf (Linux)

# Profile
perf record ./program

# Report
perf report

Optimization checklist

1. Algorithm

Check time complexity (e.g. O(n²) → O(n log n))
Remove redundant computation
Pick the right data structure

2. Memory

Use reserve() where size is known
Remove unnecessary copies
Apply move semantics

3. Compiler

Use -O2 or -O3 in release builds
Use inline / constexpr where appropriate
Consider LTO

4. Cache

Improve locality
Prefer sequential access patterns
Minimize struct padding where it matters

5. Parallelism

Consider multithreading
Use SIMD where applicable
GPU (CUDA, OpenCL) when the problem fits

Common mistakes

Mistake 1: Premature optimization

// ❌ Hard to read “clever” code
int x = (a << 1) + (b >> 2);

// ✅ Clear code (compiler optimizes well)
int x = a * 2 + b / 4;

Mistake 2: Optimizing without profiling

1. Find bottlenecks with a profiler
2. Optimize only those hotspots
3. Confirm with another profile run

Mistake 3: Micro-optimizing before the big wins

Algorithm improvements > data structure choice > line-level tweaks

FAQ

Q1: When should I optimize?

Confirm hotspots with profiling
Verify the hotspot matters for your SLO/users
Measure again after changes

Q2: What optimization pays off most?

A: Improving algorithms—asymptotics dominate (e.g. O(n²) → O(n log n)).

Q3: Can I trust the compiler?

A: Yes for most local optimizations; still measure hot paths.

Q4: Performance vs readability?

A: Prefer readability; optimize proven bottlenecks.

Q5: Recommended profiling tools?

Linux: perf, Valgrind
Windows: Visual Studio Profiler
Cross-platform: Tracy Profiler

Q6: Learning resources?

Optimized C++ by Kurt Guntheroth
CppCon talks
Compiler Explorer