C++ Cache Optimization: Locality, False Sharing, SoA vs AoS
이 글의 핵심
Practical C++ cache optimization from principles to patterns.
What is cache optimization?
Improving how well your code uses the CPU cache hierarchy.
// ❌ Many cache misses (column-major stride in row storage)
for (int j = 0; j < cols; ++j) {
for (int i = 0; i < rows; ++i) {
sum += matrix[i][j];
}
}
// ✅ Better spatial locality (row-major scan)
for (int i = 0; i < rows; ++i) {
for (int j = 0; j < cols; ++j) {
sum += matrix[i][j];
}
}
Spatial locality
std::vector<int> v(1000);
// ✅ Sequential access
for (int i = 0; i < v.size(); ++i) {
process(v[i]);
}
// ❌ Random access pattern
for (int i = 0; i < v.size(); ++i) {
process(v[rand() % v.size()]);
}
Examples
Example 1: Matrix multiply
// ❌ Cache-unfriendly (strided access into B)
void matmul_slow(int** A, int** B, int** C, int n) {
for (int i = 0; i < n; ++i) {
for (int j = 0; j < n; ++j) {
for (int k = 0; k < n; ++k) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
// ✅ Better with transposed B for sequential reads
void matmul_fast(int** A, int** B, int** C, int n) {
int** BT = transpose(B, n);
for (int i = 0; i < n; ++i) {
for (int j = 0; j < n; ++j) {
for (int k = 0; k < n; ++k) {
C[i][j] += A[i][k] * BT[j][k];
}
}
}
}
Example 2: Struct layout
// ❌ Padding / cache-line waste
struct Bad {
char a;
int b;
char c;
long d;
};
// ✅ Tighter packing (example; align to your ABI needs)
struct Good {
long d;
int b;
char a;
char c;
};
Example 3: Prefetching
#include <xmmintrin.h>
void processWithPrefetch(int* data, int n) {
constexpr int prefetchDistance = 64;
for (int i = 0; i < n; ++i) {
if (i + prefetchDistance < n) {
_mm_prefetch(&data[i + prefetchDistance], _MM_HINT_T0);
}
process(data[i]);
}
}
Example 4: Blocked (tiled) multiply
void matmul_blocked(int** A, int** B, int** C, int n, int blockSize) {
for (int i = 0; i < n; i += blockSize) {
for (int j = 0; j < n; j += blockSize) {
for (int k = 0; k < n; k += blockSize) {
for (int ii = i; ii < std::min(i + blockSize, n); ++ii) {
for (int jj = j; jj < std::min(j + blockSize, n); ++jj) {
for (int kk = k; kk < std::min(k + blockSize, n); ++kk) {
C[ii][jj] += A[ii][kk] * B[kk][jj];
}
}
}
}
}
}
}
Cache lines
constexpr size_t CACHE_LINE_SIZE = 64;
alignas(CACHE_LINE_SIZE) int data[16];
struct alignas(CACHE_LINE_SIZE) Counter {
std::atomic<int> value{0};
};
Common pitfalls
Pitfall 1: False sharing
// ❌ false sharing
std::atomic<int> counters[4];
// ✅ pad to separate cache lines
struct alignas(64) PaddedCounter {
std::atomic<int> value{0};
};
PaddedCounter counters[4];
Pitfall 2: Large stride
for (int i = 0; i < n; i += 1000) {
process(data[i]);
}
Pitfall 3: AoS vs SoA
// ❌ AoS: may scatter hot fields
struct Particle {
float x, y, z;
float vx, vy, vz;
};
std::vector<Particle> particles;
// ✅ SoA: sequential access per field
struct Particles {
std::vector<float> x, y, z;
std::vector<float> vx, vy, vz;
};
Pitfall 4: Alignment for SIMD
float* alignedData = (float*)aligned_alloc(32, 100 * sizeof(float));
__m256 v = _mm256_load_ps(alignedData);
FAQ
Q1: Cache optimization?
A: Making memory access patterns friendlier to CPU caches.
Q2: Spatial locality?
A: Accessing nearby addresses in sequence.
Q3: False sharing?
A: Independent variables on the same cache line; use padding or per-line layout.
Q4: AoS vs SoA?
A: AoS is array of structs; SoA is struct of arrays—often better for SIMD and sequential field access.
Q5: Prefetching?
A: Hint loads ahead of use to hide latency.
Q6: Resources?
A: Computer Architecture, Optimized C++, What Every Programmer Should Know About Memory.
See also
- C++ memory pool
- C++ stack allocator
- C++ exception performance
Practical tips
Debugging
- Fix warnings first
- Reproduce with small tests
Performance
- Profile before optimizing
- Set measurable goals
Code review
- Follow conventions
Checklist
Before coding
- Right technique?
- Maintainable?
- Meets performance goals?
While coding
- Warnings cleared?
- Edge cases?
- Errors handled?
At review
- Intent clear?
- Tests enough?
- Docs OK?
Keywords
C++, cache, optimization, performance, memory, false sharing, locality.
Related posts
- C++ exception performance
- C++ flyweight pattern
- C++ memory pool
- C++ profiling
- C++ stack allocator