[2026] C++ Performance Optimization | Practical Techniques to Run “10× Faster” [#42-1]
이 글의 핵심
C++ performance: “10× faster” patterns—cut copies, optimize allocation, cache-friendly data, compiler opts, SIMD, and measure with profilers.
1. Remove unnecessary copies
Pass by value vs by reference
Example process implementations:
// ❌ Slow (copy)
void process(vector<int> data) {
// ...
}
// ✅ Fast (reference)
void process(const vector<int>& data) {
// ...
}
// ✅ When mutation is required
void process(vector<int>& data) {
// ...
}
Move semantics
C++ example:
// ❌ Copy
vector<int> createLargeVector() {
vector<int> v(1000000);
return v; // copy may occur (older compilers / pessimization)
}
// ✅ Move
vector<int> result = createLargeVector(); // moved (C++11+)
// Explicit move
vector<int> v1 = {1, 2, 3};
vector<int> v2 = move(v1); // v1 is left empty
2. Memory allocation optimization
Use reserve to avoid repeated reallocations
// ❌ Many reallocations
vector<int> v;
for (int i = 0; i < 1000; i++) {
v.push_back(i); // reallocates multiple times
}
// ✅ Allocate once
vector<int> v;
v.reserve(1000); // reserve up front
for (int i = 0; i < 1000; i++) {
v.push_back(i);
}
Object pool
template <typename T>
class ObjectPool {
private:
vector<unique_ptr<T>> pool;
public:
T* acquire() {
if (pool.empty()) {
return new T();
}
T* obj = pool.back().release();
pool.pop_back();
return obj;
}
void release(T* obj) {
pool.push_back(unique_ptr<T>(obj));
}
};
Everyday analogy: think of memory like an apartment building. The stack is like an elevator—fast but limited. The heap is like a warehouse—spacious but takes longer to “fetch” things. A pointer is a slip of paper with an address, e.g. “floor 3, unit 302.”
3. Cache-friendly code
Data locality
// ❌ Many cache misses
struct Bad {
int id;
char padding[60]; // wastes cache line
int value;
};
// ✅ Cache-friendly
struct Good {
int id;
int value;
// keep related fields together
};
Traversing a matrix
int matrix[1000][1000];
// ❌ Slow (poor locality for row-major layout)
for (int j = 0; j < 1000; j++) {
for (int i = 0; i < 1000; i++) {
matrix[i][j] = 0;
}
}
// ✅ Fast (sequential access)
for (int i = 0; i < 1000; i++) {
for (int j = 0; j < 1000; j++) {
matrix[i][j] = 0;
}
}
4. Compiler optimizations
Inline functions
Example add:
// ❌ Function call overhead (may still be inlined at -O2)
int add(int a, int b) {
return a + b;
}
// ✅ inline hint
inline int add(int a, int b) {
return a + b;
}
// ✅ constexpr (compile-time when possible)
constexpr int add(int a, int b) {
return a + b;
}
Compiler flags
Run in the terminal:
# Optimization levels
g++ -O0 # no optimization
g++ -O1 # basic
g++ -O2 # commonly recommended
g++ -O3 # aggressive
# Extras
g++ -O3 -march=native # tune for local CPU
g++ -O3 -flto # link-time optimization
Hands-on examples
Example 1: String concatenation
#include <iostream>
#include <string>
#include <sstream>
#include <chrono>
using namespace std;
// ❌ Slow
string concat1(int n) {
string result;
for (int i = 0; i < n; i++) {
result += to_string(i); // reallocates often
}
return result;
}
// ✅ Faster
string concat2(int n) {
ostringstream oss;
for (int i = 0; i < n; i++) {
oss << i;
}
return oss.str();
}
int main() {
auto start = chrono::high_resolution_clock::now();
concat1(10000);
auto end = chrono::high_resolution_clock::now();
cout << "concat1: " << chrono::duration_cast<chrono::milliseconds>(end - start).count() << "ms" << endl;
start = chrono::high_resolution_clock::now();
concat2(10000);
end = chrono::high_resolution_clock::now();
cout << "concat2: " << chrono::duration_cast<chrono::milliseconds>(end - start).count() << "ms" << endl;
}
Note: ostringstream is often much faster than repeated string += for many appends.
Example 2: Lookup table
#include <iostream>
#include <cmath>
#include <chrono>
using namespace std;
// ❌ Slow (recomputes every time)
double slow(int x) {
return sin(x * 0.01);
}
// ✅ Fast (precomputed)
class FastSin {
private:
static constexpr int SIZE = 360;
double table[SIZE];
public:
FastSin() {
for (int i = 0; i < SIZE; i++) {
table[i] = sin(i * 0.01);
}
}
double get(int x) {
return table[x % SIZE];
}
};
int main() {
FastSin fastSin;
auto start = chrono::high_resolution_clock::now();
for (int i = 0; i < 1000000; i++) {
slow(i);
}
auto end = chrono::high_resolution_clock::now();
cout << "slow: " << chrono::duration_cast<chrono::milliseconds>(end - start).count() << "ms" << endl;
start = chrono::high_resolution_clock::now();
for (int i = 0; i < 1000000; i++) {
fastSin.get(i);
}
end = chrono::high_resolution_clock::now();
cout << "fast: " << chrono::duration_cast<chrono::milliseconds>(end - start).count() << "ms" << endl;
}
Note: repetitive math can often be replaced with a lookup table (mind accuracy and memory).
Example 3: SIMD optimization
#include <immintrin.h> // AVX
#include <iostream>
using namespace std;
// ❌ Scalar loop
void add_scalar(float* a, float* b, float* c, int n) {
for (int i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
// ✅ SIMD (process 8 floats at a time; assumes n multiple of 8 in this sketch)
void add_simd(float* a, float* b, float* c, int n) {
for (int i = 0; i < n; i += 8) {
__m256 va = _mm256_loadu_ps(&a[i]);
__m256 vb = _mm256_loadu_ps(&b[i]);
__m256 vc = _mm256_add_ps(va, vb);
_mm256_storeu_ps(&c[i], vc);
}
}
Note: SIMD processes multiple lanes in parallel; add a scalar tail loop for general n.
Profiling tools
1. gprof
# Compile
g++ -pg program.cpp -o program
# Run
./program
# Inspect profile
gprof program gmon.out > analysis.txt
2. Valgrind (Callgrind)
# Profile
valgrind --tool=callgrind ./program
# View results
kcachegrind callgrind.out.*
3. perf (Linux)
# Profile
perf record ./program
# Report
perf report
Optimization checklist
1. Algorithm
- Check time complexity (e.g. O(n²) → O(n log n))
- Remove redundant computation
- Pick the right data structure
2. Memory
- Use
reserve()where size is known - Remove unnecessary copies
- Apply move semantics
3. Compiler
- Use
-O2or-O3in release builds - Use
inline/constexprwhere appropriate - Consider LTO
4. Cache
- Improve locality
- Prefer sequential access patterns
- Minimize struct padding where it matters
5. Parallelism
- Consider multithreading
- Use SIMD where applicable
- GPU (CUDA, OpenCL) when the problem fits
Common mistakes
Mistake 1: Premature optimization
// ❌ Hard to read “clever” code
int x = (a << 1) + (b >> 2);
// ✅ Clear code (compiler optimizes well)
int x = a * 2 + b / 4;
Mistake 2: Optimizing without profiling
1. Find bottlenecks with a profiler
2. Optimize only those hotspots
3. Confirm with another profile run
Mistake 3: Micro-optimizing before the big wins
Algorithm improvements > data structure choice > line-level tweaks
FAQ
Q1: When should I optimize?
A
- Confirm hotspots with profiling
- Verify the hotspot matters for your SLO/users
- Measure again after changes
Q2: What optimization pays off most?
A: Improving algorithms—asymptotics dominate (e.g. O(n²) → O(n log n)).
Q3: Can I trust the compiler?
A: Yes for most local optimizations; still measure hot paths.
Q4: Performance vs readability?
A: Prefer readability; optimize proven bottlenecks.
Q5: Recommended profiling tools?
A
- Linux: perf, Valgrind
- Windows: Visual Studio Profiler
- Cross-platform: Tracy Profiler
Q6: Learning resources?
A
- Optimized C++ by Kurt Guntheroth
- CppCon talks
- Compiler Explorer
See also (internal links)
- C++ alignment and padding
- C++ profiling
- C++ profiling — find bottlenecks with perf and gprof
Related posts
- C++ algorithm sort
- C++ alignment and padding
- C++ benchmarking
- C++ cache optimization
- C++
stringvsstring_view