본문으로 건너뛰기
Previous
Next
Cache-Friendly C++: Data-Oriented Design and AoS vs SoA

Cache-Friendly C++: Data-Oriented Design and AoS vs SoA

Cache-Friendly C++: Data-Oriented Design and AoS vs SoA

이 글의 핵심

CPU speed has outpaced memory bandwidth. How you lay out data in memory determines whether your hot loops run at full speed or wait for cache misses. This guide covers AoS vs SoA, cache line alignment, and false sharing with measurable examples.

Why Memory Layout Decides Performance

Modern CPUs can execute billions of instructions per second, but main memory delivers data at a much slower rate. When the CPU needs data that isn’t in cache, it stalls and waits — a cache miss typically costs 100-300 cycles.

A well-tuned algorithm with poor memory layout often runs slower than a naive algorithm with cache-friendly layout. This is the core insight of data-oriented design (DoD): design your data structures for how the CPU accesses them, not just for logical convenience.

L1 cache hit:  ~4 cycles   (data already in L1)
L2 cache hit:  ~12 cycles
L3 cache hit:  ~40 cycles
RAM:           ~200 cycles  (50x slower than L1)

Cache Lines — The Unit of Memory Transfer

The CPU doesn’t fetch individual bytes from memory. It fetches cache lines — typically 64 contiguous bytes. When you access one byte, the surrounding 63 bytes come along for free.

This matters for data layout:

  • If the next data you need is in the same cache line, it’s free
  • If it’s in a different cache line, you pay the full cache miss penalty
// Each line: 64 bytes = 16 ints
int arr[1000];

// Sequential access — cache-friendly
// After loading arr[0], arr[1]..arr[15] are already in cache
for (int i = 0; i < 1000; i++) {
    process(arr[i]);  // hits L1 ~15 out of 16 accesses
}

// Strided access — cache-unfriendly
// Jumps 64 bytes each iteration — every access may miss
for (int i = 0; i < 1000; i += 16) {
    process(arr[i]);  // potential cache miss every time
}

Array of Structures vs Structure of Arrays

The most impactful layout decision for performance-critical code.

Array of Structures (AoS)

The natural object-oriented layout:

struct Particle {
    float x, y, z;      // position (12 bytes)
    float vx, vy, vz;   // velocity (12 bytes)
    float mass;          // (4 bytes)
    int id;             // (4 bytes)
    uint32_t color;     // (4 bytes)
    // total: ~40 bytes per particle
};

std::vector<Particle> particles(100'000);

// Position-only update loop
for (auto& p : particles) {
    p.x += p.vx * dt;
    p.y += p.vy * dt;
    p.z += p.vz * dt;
    // mass, id, color are LOADED but NEVER USED — wasted bandwidth
}

With 40-byte structs, each cache line (64 bytes) holds ~1.6 particles. The position update loop accesses x, vx, y, vy, z, vz — which are all in the same struct. But it also loads mass, id, color from every cache line, wasting 28% of memory bandwidth.

Structure of Arrays (SoA)

struct Particles {
    std::vector<float> x, y, z;    // all x positions together
    std::vector<float> vx, vy, vz; // all velocities together
    std::vector<float> mass;
    std::vector<int> id;
    std::vector<uint32_t> color;
    std::size_t count;
};

Particles particles;
particles.count = 100'000;
particles.x.resize(100'000);
// ... initialize all arrays

// Position-only update loop — reads only what it needs
for (std::size_t i = 0; i < particles.count; ++i) {
    particles.x[i] += particles.vx[i] * dt;
    particles.y[i] += particles.vy[i] * dt;
    particles.z[i] += particles.vz[i] * dt;
}
// `mass`, `id`, `color` arrays are never touched — zero wasted bandwidth
// x[0]..x[15] fit in one cache line, prefetching ahead works perfectly

For SIMD (AVX2 processes 8 floats at once), SoA aligns perfectly — you can load 8 consecutive x values and 8 consecutive vx values into SIMD registers and update them all at once.

When AoS Is Better

SoA is not universally better:

  • Small counts (< ~1000 objects): cache behavior difference is negligible
  • Accessing many fields at once: if your loop touches x, y, z, mass every time, AoS keeps them together
  • Random access by object: if you frequently look up a single particle by index and use all its data, AoS has better locality for that use case
  • Code clarity: SoA code is more verbose and error-prone

Rule: profile first. If a loop touches fewer than half the struct’s fields on most iterations, SoA is likely faster.


Cache Line Alignment

Basic Alignment

#include <cstddef>
#include <new>  // for std::hardware_destructive_interference_size

// Align a struct to the cache line boundary
struct alignas(64) HotData {
    int counter;
    float accumulated;
    // ... rest of the fields
};

// C++17 portable cache line size
constexpr std::size_t CACHE_LINE = std::hardware_destructive_interference_size;

struct alignas(CACHE_LINE) ThreadLocal {
    long count;
    double sum;
};

Alignment matters most for data that multiple threads write to, or that is accessed in tight loops where cache line boundaries matter.

Checking Your Layout

#include <iostream>

struct AoS_Particle {
    float x, y, z;   // 12 bytes
    float vx, vy, vz; // 12 bytes
    int id;           // 4 bytes
};
// Total: 28 bytes — 2.28 particles per 64-byte cache line

int main() {
    std::cout << "Particle size: " << sizeof(AoS_Particle) << '\n';
    std::cout << "Particles per cache line: "
              << 64.0 / sizeof(AoS_Particle) << '\n';
    // Particles per cache line: 2.28
}

False Sharing

False sharing is the silent multi-threaded performance killer. Two threads write to different variables, but those variables share a cache line:

// PROBLEM: both counters on the same cache line
struct Counters {
    long thread0_count;  // bytes 0-7
    long thread1_count;  // bytes 8-15
    // Both in the same 64-byte cache line!
};

Counters shared;

// Thread 0 writes shared.thread0_count → invalidates the cache line on Thread 1
// Thread 1 writes shared.thread1_count → invalidates the cache line on Thread 0
// They fight over the same cache line even though they write different data

Fix: pad to separate cache lines

struct alignas(64) PaddedCounter {
    long count;
    char padding[64 - sizeof(long)];  // fill the rest of the cache line
};

// Or simpler with alignas:
struct ThreadCounters {
    alignas(64) long thread0_count;
    alignas(64) long thread1_count;
    // Each is on its own cache line — no interference
};

// Or use hardware_destructive_interference_size:
struct SafeCounters {
    alignas(std::hardware_destructive_interference_size) long thread0_count;
    alignas(std::hardware_destructive_interference_size) long thread1_count;
};

Measuring false sharing:

# Linux: watch cache line invalidations with perf
perf stat -e cache-misses,cache-references,LLC-load-misses ./my_program

# Compare with padded vs unpadded version
# False sharing shows as high cache miss rate even with small working sets

SoA Pitfalls

Index Mismatch on Delete

SoA stores parallel arrays — indices must stay consistent across all arrays:

// The swap-with-last delete — must apply to ALL arrays
void deleteParticle(Particles& p, std::size_t idx) {
    std::size_t last = p.count - 1;

    // Swap with last in EVERY array
    std::swap(p.x[idx],   p.x[last]);
    std::swap(p.y[idx],   p.y[last]);
    std::swap(p.z[idx],   p.z[last]);
    std::swap(p.vx[idx],  p.vx[last]);
    std::swap(p.vy[idx],  p.vy[last]);
    std::swap(p.vz[idx],  p.vz[last]);
    std::swap(p.mass[idx], p.mass[last]);
    std::swap(p.id[idx],   p.id[last]);

    p.count--;
    // Forget one array → silent data corruption
}

Random Access in SoA

If you frequently access all fields of a single object by index, SoA is slower — each field access may be in a different cache line:

// SoA — random access by index fetches from many cache lines
void printParticle(const Particles& p, std::size_t idx) {
    std::cout << p.x[idx] << ' '     // cache miss
              << p.y[idx] << ' '     // cache miss (different array)
              << p.z[idx] << ' '     // cache miss
              << p.mass[idx] << '\n'; // cache miss
}

For mixed access patterns, AoSoA (Array of Structures of Arrays) is sometimes the best compromise — groups of 8 or 16 objects packed together for SIMD, multiple groups forming the outer array.


Production Patterns

Entity Component System (ECS)

ECS is the mainstream game engine approach to data-oriented design:

// Each component type is stored in a contiguous array
// Systems iterate component arrays rather than game objects

struct PositionComponent { float x, y, z; };
struct VelocityComponent { float vx, vy, vz; };
struct RenderComponent { uint32_t mesh_id; uint32_t material_id; };

// Physics system touches only Position + Velocity (tight loop, great cache use)
void physicsSystem(
    std::vector<PositionComponent>& positions,
    const std::vector<VelocityComponent>& velocities,
    float dt)
{
    for (std::size_t i = 0; i < positions.size(); ++i) {
        positions[i].x += velocities[i].vx * dt;
        positions[i].y += velocities[i].vy * dt;
        positions[i].z += velocities[i].vz * dt;
    }
}

// Render system touches only Position + Render
void renderSystem(
    const std::vector<PositionComponent>& positions,
    const std::vector<RenderComponent>& renders)
{
    for (std::size_t i = 0; i < positions.size(); ++i) {
        submitDrawCall(renders[i].mesh_id, renders[i].material_id,
            positions[i].x, positions[i].y, positions[i].z);
    }
}

Key Takeaways

  • Cache misses cost ~50-200x more than L1 hits — data layout is as important as algorithm choice
  • AoS is natural for OOP; SoA wins when loops access only a few fields from large arrays
  • Cache lines are 64 bytes — pack frequently co-accessed data within 64-byte boundaries
  • False sharing occurs when two threads write different variables that share a cache line — use alignas(64) to isolate hot written data
  • SoA pitfalls: maintain index consistency across all parallel arrays, and avoid random single-object lookups
  • Measure with perf stat -e cache-misses before optimizing — false sharing and AoS/SoA problems look similar in profiles
  • ECS (Entity Component System) in game engines is the mainstream application of SoA principles at scale

자주 묻는 질문 (FAQ)

Q. 이 내용을 실무에서 언제 쓰나요?

A. Data-oriented design for C++ performance: AoS vs SoA layout, cache lines, false sharing, alignas, benchmarking with perf… 실무에서는 위 본문의 예제와 선택 가이드를 참고해 적용하면 됩니다.

Q. 선행으로 읽으면 좋은 글은?

A. 각 글 하단의 이전 글 또는 관련 글 링크를 따라가면 순서대로 배울 수 있습니다. C++ 시리즈 목차에서 전체 흐름을 확인할 수 있습니다.

Q. 더 깊이 공부하려면?

A. cppreference와 해당 라이브러리 공식 문서를 참고하세요. 글 말미의 참고 자료 링크도 활용하면 좋습니다.


같이 보면 좋은 글 (내부 링크)

이 주제와 연결되는 다른 글입니다.


이 글에서 다루는 키워드 (관련 검색어)

C++, Performance, Data-Oriented Design, Cache, Cache Line, SoA, AoS 등으로 검색하시면 이 글이 도움이 됩니다.