C++ Cache Optimization: Locality, False Sharing, SoA vs AoS

2026년 3월 12일 · 11분 읽기 · 수정 2026년 3월 12일 Advanced Tutorial

이 글의 핵심

Practical C++ cache optimization from principles to patterns.

What is cache optimization?

Improving how well your code uses the CPU cache hierarchy.

// ❌ Many cache misses (column-major stride in row storage)
for (int j = 0; j < cols; ++j) {
    for (int i = 0; i < rows; ++i) {
        sum += matrix[i][j];
    }
}

// ✅ Better spatial locality (row-major scan)
for (int i = 0; i < rows; ++i) {
    for (int j = 0; j < cols; ++j) {
        sum += matrix[i][j];
    }
}

Spatial locality

std::vector<int> v(1000);

// ✅ Sequential access
for (int i = 0; i < v.size(); ++i) {
    process(v[i]);
}

// ❌ Random access pattern
for (int i = 0; i < v.size(); ++i) {
    process(v[rand() % v.size()]);
}

Examples

Example 1: Matrix multiply

// ❌ Cache-unfriendly (strided access into B)
void matmul_slow(int** A, int** B, int** C, int n) {
    for (int i = 0; i < n; ++i) {
        for (int j = 0; j < n; ++j) {
            for (int k = 0; k < n; ++k) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
}

// ✅ Better with transposed B for sequential reads
void matmul_fast(int** A, int** B, int** C, int n) {
    int** BT = transpose(B, n);
    
    for (int i = 0; i < n; ++i) {
        for (int j = 0; j < n; ++j) {
            for (int k = 0; k < n; ++k) {
                C[i][j] += A[i][k] * BT[j][k];
            }
        }
    }
}

Example 2: Struct layout

// ❌ Padding / cache-line waste
struct Bad {
    char a;
    int b;
    char c;
    long d;
};

// ✅ Tighter packing (example; align to your ABI needs)
struct Good {
    long d;
    int b;
    char a;
    char c;
};

Example 3: Prefetching

#include <xmmintrin.h>

void processWithPrefetch(int* data, int n) {
    constexpr int prefetchDistance = 64;
    
    for (int i = 0; i < n; ++i) {
        if (i + prefetchDistance < n) {
            _mm_prefetch(&data[i + prefetchDistance], _MM_HINT_T0);
        }
        process(data[i]);
    }
}

Example 4: Blocked (tiled) multiply

void matmul_blocked(int** A, int** B, int** C, int n, int blockSize) {
    for (int i = 0; i < n; i += blockSize) {
        for (int j = 0; j < n; j += blockSize) {
            for (int k = 0; k < n; k += blockSize) {
                for (int ii = i; ii < std::min(i + blockSize, n); ++ii) {
                    for (int jj = j; jj < std::min(j + blockSize, n); ++jj) {
                        for (int kk = k; kk < std::min(k + blockSize, n); ++kk) {
                            C[ii][jj] += A[ii][kk] * B[kk][jj];
                        }
                    }
                }
            }
        }
    }
}

Cache lines

constexpr size_t CACHE_LINE_SIZE = 64;

alignas(CACHE_LINE_SIZE) int data[16];

struct alignas(CACHE_LINE_SIZE) Counter {
    std::atomic<int> value{0};
};

Common pitfalls

// ❌ false sharing
std::atomic<int> counters[4];

// ✅ pad to separate cache lines
struct alignas(64) PaddedCounter {
    std::atomic<int> value{0};
};
PaddedCounter counters[4];

Pitfall 2: Large stride

for (int i = 0; i < n; i += 1000) {
    process(data[i]);
}

Pitfall 3: AoS vs SoA

// ❌ AoS: may scatter hot fields
struct Particle {
    float x, y, z;
    float vx, vy, vz;
};
std::vector<Particle> particles;

// ✅ SoA: sequential access per field
struct Particles {
    std::vector<float> x, y, z;
    std::vector<float> vx, vy, vz;
};

Pitfall 4: Alignment for SIMD

float* alignedData = (float*)aligned_alloc(32, 100 * sizeof(float));
__m256 v = _mm256_load_ps(alignedData);

FAQ

Q1: Cache optimization?

A: Making memory access patterns friendlier to CPU caches.

Q2: Spatial locality?

A: Accessing nearby addresses in sequence.

A: Independent variables on the same cache line; use padding or per-line layout.

Q4: AoS vs SoA?

A: AoS is array of structs; SoA is struct of arrays—often better for SIMD and sequential field access.

Q5: Prefetching?

A: Hint loads ahead of use to hide latency.

Q6: Resources?

A: Computer Architecture, Optimized C++, What Every Programmer Should Know About Memory.

Practical tips

Debugging

Fix warnings first
Reproduce with small tests

Performance

Profile before optimizing
Set measurable goals

Code review

Follow conventions

Checklist

Before coding

Right technique?
Maintainable?
Meets performance goals?

While coding

Warnings cleared?
Edge cases?
Errors handled?

At review

Intent clear?
Tests enough?
Docs OK?

Keywords

C++, cache, optimization, performance, memory, false sharing, locality.

C++ exception performance
C++ flyweight pattern
C++ memory pool
C++ profiling
C++ stack allocator