본문으로 건너뛰기 [2026] C++ Performance Optimization | Practical Techniques to Run “10× Faster” [#42-1]

[2026] C++ Performance Optimization | Practical Techniques to Run “10× Faster” [#42-1]

[2026] C++ Performance Optimization | Practical Techniques to Run “10× Faster” [#42-1]

이 글의 핵심

C++ performance: “10× faster” patterns—cut copies, optimize allocation, cache-friendly data, compiler opts, SIMD, and measure with profilers.

1. Remove unnecessary copies

Pass by value vs by reference

Example process implementations:

// ❌ Slow (copy)
void process(vector<int> data) {
    // ...
}

// ✅ Fast (reference)
void process(const vector<int>& data) {
    // ...
}

// ✅ When mutation is required
void process(vector<int>& data) {
    // ...
}

Move semantics

C++ example:

// ❌ Copy
vector<int> createLargeVector() {
    vector<int> v(1000000);
    return v;  // copy may occur (older compilers / pessimization)
}

// ✅ Move
vector<int> result = createLargeVector();  // moved (C++11+)

// Explicit move
vector<int> v1 = {1, 2, 3};
vector<int> v2 = move(v1);  // v1 is left empty

2. Memory allocation optimization

Use reserve to avoid repeated reallocations

// ❌ Many reallocations
vector<int> v;
for (int i = 0; i < 1000; i++) {
    v.push_back(i);  // reallocates multiple times
}

// ✅ Allocate once
vector<int> v;
v.reserve(1000);  // reserve up front
for (int i = 0; i < 1000; i++) {
    v.push_back(i);
}

Object pool

template <typename T>
class ObjectPool {
private:
    vector<unique_ptr<T>> pool;

public:
    T* acquire() {
        if (pool.empty()) {
            return new T();
        }
        T* obj = pool.back().release();
        pool.pop_back();
        return obj;
    }

    void release(T* obj) {
        pool.push_back(unique_ptr<T>(obj));
    }
};

Everyday analogy: think of memory like an apartment building. The stack is like an elevator—fast but limited. The heap is like a warehouse—spacious but takes longer to “fetch” things. A pointer is a slip of paper with an address, e.g. “floor 3, unit 302.”

3. Cache-friendly code

Data locality

// ❌ Many cache misses
struct Bad {
    int id;
    char padding[60];  // wastes cache line
    int value;
};

// ✅ Cache-friendly
struct Good {
    int id;
    int value;
    // keep related fields together
};

Traversing a matrix

int matrix[1000][1000];

// ❌ Slow (poor locality for row-major layout)
for (int j = 0; j < 1000; j++) {
    for (int i = 0; i < 1000; i++) {
        matrix[i][j] = 0;
    }
}

// ✅ Fast (sequential access)
for (int i = 0; i < 1000; i++) {
    for (int j = 0; j < 1000; j++) {
        matrix[i][j] = 0;
    }
}

4. Compiler optimizations

Inline functions

Example add:

// ❌ Function call overhead (may still be inlined at -O2)
int add(int a, int b) {
    return a + b;
}

// ✅ inline hint
inline int add(int a, int b) {
    return a + b;
}

// ✅ constexpr (compile-time when possible)
constexpr int add(int a, int b) {
    return a + b;
}

Compiler flags

Run in the terminal:

# Optimization levels
g++ -O0  # no optimization
g++ -O1  # basic
g++ -O2  # commonly recommended
g++ -O3  # aggressive

# Extras
g++ -O3 -march=native  # tune for local CPU
g++ -O3 -flto          # link-time optimization

Hands-on examples

Example 1: String concatenation

#include <iostream>
#include <string>
#include <sstream>
#include <chrono>
using namespace std;

// ❌ Slow
string concat1(int n) {
    string result;
    for (int i = 0; i < n; i++) {
        result += to_string(i);  // reallocates often
    }
    return result;
}

// ✅ Faster
string concat2(int n) {
    ostringstream oss;
    for (int i = 0; i < n; i++) {
        oss << i;
    }
    return oss.str();
}

int main() {
    auto start = chrono::high_resolution_clock::now();
    concat1(10000);
    auto end = chrono::high_resolution_clock::now();
    cout << "concat1: " << chrono::duration_cast<chrono::milliseconds>(end - start).count() << "ms" << endl;

    start = chrono::high_resolution_clock::now();
    concat2(10000);
    end = chrono::high_resolution_clock::now();
    cout << "concat2: " << chrono::duration_cast<chrono::milliseconds>(end - start).count() << "ms" << endl;
}

Note: ostringstream is often much faster than repeated string += for many appends.

Example 2: Lookup table

#include <iostream>
#include <cmath>
#include <chrono>
using namespace std;

// ❌ Slow (recomputes every time)
double slow(int x) {
    return sin(x * 0.01);
}

// ✅ Fast (precomputed)
class FastSin {
private:
    static constexpr int SIZE = 360;
    double table[SIZE];

public:
    FastSin() {
        for (int i = 0; i < SIZE; i++) {
            table[i] = sin(i * 0.01);
        }
    }

    double get(int x) {
        return table[x % SIZE];
    }
};

int main() {
    FastSin fastSin;

    auto start = chrono::high_resolution_clock::now();
    for (int i = 0; i < 1000000; i++) {
        slow(i);
    }
    auto end = chrono::high_resolution_clock::now();
    cout << "slow: " << chrono::duration_cast<chrono::milliseconds>(end - start).count() << "ms" << endl;

    start = chrono::high_resolution_clock::now();
    for (int i = 0; i < 1000000; i++) {
        fastSin.get(i);
    }
    end = chrono::high_resolution_clock::now();
    cout << "fast: " << chrono::duration_cast<chrono::milliseconds>(end - start).count() << "ms" << endl;
}

Note: repetitive math can often be replaced with a lookup table (mind accuracy and memory).

Example 3: SIMD optimization

#include <immintrin.h>  // AVX
#include <iostream>
using namespace std;

// ❌ Scalar loop
void add_scalar(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

// ✅ SIMD (process 8 floats at a time; assumes n multiple of 8 in this sketch)
void add_simd(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vc = _mm256_add_ps(va, vb);
        _mm256_storeu_ps(&c[i], vc);
    }
}

Note: SIMD processes multiple lanes in parallel; add a scalar tail loop for general n.

Profiling tools

1. gprof

# Compile
g++ -pg program.cpp -o program

# Run
./program

# Inspect profile
gprof program gmon.out > analysis.txt

2. Valgrind (Callgrind)

# Profile
valgrind --tool=callgrind ./program

# View results
kcachegrind callgrind.out.*

3. perf (Linux)

# Profile
perf record ./program

# Report
perf report

Optimization checklist

1. Algorithm

  • Check time complexity (e.g. O(n²) → O(n log n))
  • Remove redundant computation
  • Pick the right data structure

2. Memory

  • Use reserve() where size is known
  • Remove unnecessary copies
  • Apply move semantics

3. Compiler

  • Use -O2 or -O3 in release builds
  • Use inline / constexpr where appropriate
  • Consider LTO

4. Cache

  • Improve locality
  • Prefer sequential access patterns
  • Minimize struct padding where it matters

5. Parallelism

  • Consider multithreading
  • Use SIMD where applicable
  • GPU (CUDA, OpenCL) when the problem fits

Common mistakes

Mistake 1: Premature optimization

// ❌ Hard to read “clever” code
int x = (a << 1) + (b >> 2);

// ✅ Clear code (compiler optimizes well)
int x = a * 2 + b / 4;

Mistake 2: Optimizing without profiling

1. Find bottlenecks with a profiler
2. Optimize only those hotspots
3. Confirm with another profile run

Mistake 3: Micro-optimizing before the big wins

Algorithm improvements > data structure choice > line-level tweaks

FAQ

Q1: When should I optimize?

A

  1. Confirm hotspots with profiling
  2. Verify the hotspot matters for your SLO/users
  3. Measure again after changes

Q2: What optimization pays off most?

A: Improving algorithms—asymptotics dominate (e.g. O(n²) → O(n log n)).

Q3: Can I trust the compiler?

A: Yes for most local optimizations; still measure hot paths.

Q4: Performance vs readability?

A: Prefer readability; optimize proven bottlenecks.

A

  • Linux: perf, Valgrind
  • Windows: Visual Studio Profiler
  • Cross-platform: Tracy Profiler

Q6: Learning resources?

A


  • C++ alignment and padding
  • C++ profiling
  • C++ profiling — find bottlenecks with perf and gprof
  • C++ algorithm sort
  • C++ alignment and padding
  • C++ benchmarking
  • C++ cache optimization
  • C++ string vs string_view