C++ Cache Optimization: Locality, False Sharing, SoA vs AoS

C++ Cache Optimization: Locality, False Sharing, SoA vs AoS

이 글의 핵심

Practical C++ cache optimization from principles to patterns.

What is cache optimization?

Improving how well your code uses the CPU cache hierarchy.

// ❌ Many cache misses (column-major stride in row storage)
for (int j = 0; j < cols; ++j) {
    for (int i = 0; i < rows; ++i) {
        sum += matrix[i][j];
    }
}

// ✅ Better spatial locality (row-major scan)
for (int i = 0; i < rows; ++i) {
    for (int j = 0; j < cols; ++j) {
        sum += matrix[i][j];
    }
}

Spatial locality

std::vector<int> v(1000);

// ✅ Sequential access
for (int i = 0; i < v.size(); ++i) {
    process(v[i]);
}

// ❌ Random access pattern
for (int i = 0; i < v.size(); ++i) {
    process(v[rand() % v.size()]);
}

Examples

Example 1: Matrix multiply

// ❌ Cache-unfriendly (strided access into B)
void matmul_slow(int** A, int** B, int** C, int n) {
    for (int i = 0; i < n; ++i) {
        for (int j = 0; j < n; ++j) {
            for (int k = 0; k < n; ++k) {
                C[i][j] += A[i][k] * B[k][j];
            }
        }
    }
}

// ✅ Better with transposed B for sequential reads
void matmul_fast(int** A, int** B, int** C, int n) {
    int** BT = transpose(B, n);
    
    for (int i = 0; i < n; ++i) {
        for (int j = 0; j < n; ++j) {
            for (int k = 0; k < n; ++k) {
                C[i][j] += A[i][k] * BT[j][k];
            }
        }
    }
}

Example 2: Struct layout

// ❌ Padding / cache-line waste
struct Bad {
    char a;
    int b;
    char c;
    long d;
};

// ✅ Tighter packing (example; align to your ABI needs)
struct Good {
    long d;
    int b;
    char a;
    char c;
};

Example 3: Prefetching

#include <xmmintrin.h>

void processWithPrefetch(int* data, int n) {
    constexpr int prefetchDistance = 64;
    
    for (int i = 0; i < n; ++i) {
        if (i + prefetchDistance < n) {
            _mm_prefetch(&data[i + prefetchDistance], _MM_HINT_T0);
        }
        process(data[i]);
    }
}

Example 4: Blocked (tiled) multiply

void matmul_blocked(int** A, int** B, int** C, int n, int blockSize) {
    for (int i = 0; i < n; i += blockSize) {
        for (int j = 0; j < n; j += blockSize) {
            for (int k = 0; k < n; k += blockSize) {
                for (int ii = i; ii < std::min(i + blockSize, n); ++ii) {
                    for (int jj = j; jj < std::min(j + blockSize, n); ++jj) {
                        for (int kk = k; kk < std::min(k + blockSize, n); ++kk) {
                            C[ii][jj] += A[ii][kk] * B[kk][jj];
                        }
                    }
                }
            }
        }
    }
}

Cache lines

constexpr size_t CACHE_LINE_SIZE = 64;

alignas(CACHE_LINE_SIZE) int data[16];

struct alignas(CACHE_LINE_SIZE) Counter {
    std::atomic<int> value{0};
};

Common pitfalls

Pitfall 1: False sharing

// ❌ false sharing
std::atomic<int> counters[4];

// ✅ pad to separate cache lines
struct alignas(64) PaddedCounter {
    std::atomic<int> value{0};
};
PaddedCounter counters[4];

Pitfall 2: Large stride

for (int i = 0; i < n; i += 1000) {
    process(data[i]);
}

Pitfall 3: AoS vs SoA

// ❌ AoS: may scatter hot fields
struct Particle {
    float x, y, z;
    float vx, vy, vz;
};
std::vector<Particle> particles;

// ✅ SoA: sequential access per field
struct Particles {
    std::vector<float> x, y, z;
    std::vector<float> vx, vy, vz;
};

Pitfall 4: Alignment for SIMD

float* alignedData = (float*)aligned_alloc(32, 100 * sizeof(float));
__m256 v = _mm256_load_ps(alignedData);

FAQ

Q1: Cache optimization?

A: Making memory access patterns friendlier to CPU caches.

Q2: Spatial locality?

A: Accessing nearby addresses in sequence.

Q3: False sharing?

A: Independent variables on the same cache line; use padding or per-line layout.

Q4: AoS vs SoA?

A: AoS is array of structs; SoA is struct of arrays—often better for SIMD and sequential field access.

Q5: Prefetching?

A: Hint loads ahead of use to hide latency.

Q6: Resources?

A: Computer Architecture, Optimized C++, What Every Programmer Should Know About Memory.


See also

  • C++ memory pool
  • C++ stack allocator
  • C++ exception performance

Practical tips

Debugging

  • Fix warnings first
  • Reproduce with small tests

Performance

  • Profile before optimizing
  • Set measurable goals

Code review

  • Follow conventions

Checklist

Before coding

  • Right technique?
  • Maintainable?
  • Meets performance goals?

While coding

  • Warnings cleared?
  • Edge cases?
  • Errors handled?

At review

  • Intent clear?
  • Tests enough?
  • Docs OK?

Keywords

C++, cache, optimization, performance, memory, false sharing, locality.


  • C++ exception performance
  • C++ flyweight pattern
  • C++ memory pool
  • C++ profiling
  • C++ stack allocator