C++ Performance Optimization: Copies, Allocations, Cache, and SIMD

C++ Performance Optimization: Copies, Allocations, Cache, and SIMD

이 글의 핵심

Practical C++ performance: fewer copies, better allocation, cache, and SIMD.

1. Avoid unnecessary copies

Pass by value vs reference

void process(vector<int> data) { }

void process(const vector<int>& data) { }

void process(vector<int>& data) { }

Move semantics

vector<int> createLargeVector() {
    vector<int> v(1000000);
    return v;
}

vector<int> v1 = {1, 2, 3};
vector<int> v2 = std::move(v1);

2. Memory allocation

reserve to reduce reallocations

vector<int> v;
v.reserve(1000);
for (int i = 0; i < 1000; i++) {
    v.push_back(i);
}

Object pool (sketch)

template <typename T>
class ObjectPool {
private:
    vector<unique_ptr<T>> pool;
    
public:
    T* acquire() {
        if (pool.empty()) {
            return new T();
        }
        T* obj = pool.back().release();
        pool.pop_back();
        return obj;
    }
    
    void release(T* obj) {
        pool.push_back(unique_ptr<T>(obj));
    }
};

3. Cache-friendly code

Data locality

struct Bad {
    int id;
    char padding[60];
    int value;
};

struct Good {
    int id;
    int value;
};

Matrix traversal

int matrix[1000][1000];

for (int i = 0; i < 1000; i++) {
    for (int j = 0; j < 1000; j++) {
        matrix[i][j] = 0;
    }
}

4. Compiler optimizations

Inline / constexpr

inline int add(int a, int b) {
    return a + b;
}

constexpr int add_ct(int a, int b) {
    return a + b;
}

Flags

g++ -O0
g++ -O1
g++ -O2
g++ -O3
g++ -O3 -march=native
g++ -O3 -flto

Examples

Example 1: String concatenation

#include <iostream>
#include <string>
#include <sstream>
#include <chrono>
using namespace std;

string concat1(int n) {
    string result;
    for (int i = 0; i < n; i++) {
        result += to_string(i);
    }
    return result;
}

string concat2(int n) {
    ostringstream oss;
    for (int i = 0; i < n; i++) {
        oss << i;
    }
    return oss.str();
}

ostringstream often wins over repeated += for many appends.

Example 2: Lookup table

#include <iostream>
#include <cmath>
#include <chrono>
using namespace std;

double slow(int x) {
    return sin(x * 0.01);
}

class FastSin {
private:
    static constexpr int SIZE = 360;
    double table[SIZE];
    
public:
    FastSin() {
        for (int i = 0; i < SIZE; i++) {
            table[i] = sin(i * 0.01);
        }
    }
    
    double get(int x) {
        return table[x % SIZE];
    }
};

Example 3: SIMD

#include <immintrin.h>
#include <iostream>
using namespace std;

void add_scalar(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

void add_simd(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vc = _mm256_add_ps(va, vb);
        _mm256_storeu_ps(&c[i], vc);
    }
}

Profiling tools

gprof

g++ -pg program.cpp -o program
./program
gprof program gmon.out > analysis.txt

Valgrind Callgrind

valgrind --tool=callgrind ./program
kcachegrind callgrind.out.*

perf

perf record ./program
perf report

Optimization checklist

Algorithms

  • Complexity (e.g. O(n²) → O(n log n))
  • Remove redundant work
  • Right data structure

Memory

  • reserve where needed
  • Fewer copies
  • Moves where appropriate

Compiler

  • -O2 or -O3
  • inline / constexpr where it helps
  • Consider LTO

Cache

  • Locality
  • Sequential access
  • Struct padding awareness

Parallelism

  • Threads where appropriate
  • SIMD
  • GPU when applicable

Common mistakes

Mistake 1: Premature micro-optimization

int x = a * 2 + b / 4;

Mistake 2: Optimizing without profiling

1. Profile
2. Optimize hotspots
3. Profile again

Mistake 3: Chasing micro-opts before algorithms

Algorithm > data structures > line-level tweaks

FAQ

Q1: When to optimize?

A: After profiling shows a real bottleneck; verify with measurements.

Q2: Biggest wins?

A: Algorithmic improvements (e.g. better asymptotic complexity).

Q3: Trust the compiler?

A: Yes for most local optimizations—still measure hot paths.

Q4: Performance vs readability?

A: Prefer readability; optimize proven bottlenecks.

Q5: Profiling tools?

A: Linux: perf, Valgrind; Windows: VS Profiler; cross-platform: Tracy.

Q6: Resources?

A: Optimized C++ by Kurt Guntheroth, CppCon talks, Compiler Explorer.


See also

  • C++ alignment and padding
  • C++ profiling
  • C++ profiling series

Practical tips

Debugging

  • Warnings first
  • Small repro

Performance

  • Profile before optimizing
  • Define metrics

Code review

  • Conventions

Checklist

Before coding

  • Right technique?
  • Maintainable?
  • Meets requirements?

While coding

  • Warnings?
  • Edge cases?
  • Errors?

At review

  • Clear?
  • Tests?
  • Docs?

Keywords

C++, performance, optimization, SIMD, profiling, move semantics.


  • C++ algorithm sort
  • C++ alignment and padding
  • C++ benchmarking
  • C++ cache optimization
  • C++ string vs string_view