Cache-Friendly C++: Data-Oriented Design Guide

Cache-Friendly C++: Data-Oriented Design Guide

이 글의 핵심

Hardware-aware layout: SoA vs AoS, 64-byte cache lines, false sharing, and profiling with perf.

Introduction: cache decides throughput

Modern CPUs are often memory bound. Data-oriented design (DoD) lays out data for sequential access and SIMD. Structure-of-arrays (SoA) often beats array-of-structures (AoS) when loops touch few fields of many objects. False sharing kills parallel scaling unless you pad or align per-thread counters to separate cache lines (~64 bytes).

This article covers: DoD, cache lines, alignas, false sharing, scenarios, AoS→SoA examples, pitfalls, benchmarks, engine/simulation patterns.


Table of contents

  1. Why cache optimization matters
  2. Data-oriented design
  3. Cache lines & alignment
  4. False sharing & padding
  5. Complete examples
  6. Common mistakes
  7. Benchmarks
  8. Production patterns
  9. Summary

1. Why cache optimization matters

  • 100k entities: updating position only still loads velocity/color/id in AoS → wasted bandwidth.
  • More threads slower: false sharing on adjacent counters.
  • SIMD won’t vectorize: AoS scatters x across strides.

2. Data-oriented design

flowchart TB
    subgraph AoS["AoS"]
        E1["Entity0: pos, vel, id"]
        E2["Entity1: ..."]
    end
    subgraph SoA["SoA"]
        X["x[]"]
        Y["y[]"]
        Z["z[]"]
    end
    AoS -->|"position-only loop"| Waste["Loads unused fields"]
    SoA -->|"position-only loop"| Hit["Sequential x,y,z"]

Rule of thumb: thousands+ entities, field-specific hot loops, SIMD → SoA. Small counts (<~100–1000) may favor simpler AoS.


3. Cache lines & alignment

Typical 64-byte lines. alignas(64) hot atomics/counters to separate lines. Use std::hardware_destructive_interference_size (C++17) when available.


4. False sharing

Independent variables on the same cache line invalidate each other across cores. Fix with line-sized padding or per-thread shards.


5. Complete examples

This series also walks through a full particle AoS vs SoA benchmark and padded atomic counters for parallel increments—adapt the code and comments to your codebase.


6. Common mistakes

  • SoA index mismatch after partial deletes—use swap-with-last across all arrays.
  • Over-padding everything—only hot written fields need isolation.
  • SoA with random indices loses locality—sort/pack active entities.

7. Benchmarks

Use perf stat -e cache-misses,cache-references and Release (-O3 -march=native) builds.


8. Production patterns

ECS-style component arrays, batch processing, blocked matrix multiply, hybrid AoSoA blocks.


9. Summary

TopicTakeaway
DoDPrefer SoA when loops are field-specific
Cache line64B, alignas for sharing avoidance
False sharingpad or shard counters
Productionmeasure with perf, profile hot loops

Keywords

data-oriented design, cache optimization, AoS SoA, false sharing, cache line, SIMD

Next: Custom allocators & pmr (#39-2)
Previous: PIMPL & ABI (#38-3)