Cache-Friendly C++: Data-Oriented Design Guide
이 글의 핵심
Hardware-aware layout: SoA vs AoS, 64-byte cache lines, false sharing, and profiling with perf.
Introduction: cache decides throughput
Modern CPUs are often memory bound. Data-oriented design (DoD) lays out data for sequential access and SIMD. Structure-of-arrays (SoA) often beats array-of-structures (AoS) when loops touch few fields of many objects. False sharing kills parallel scaling unless you pad or align per-thread counters to separate cache lines (~64 bytes).
This article covers: DoD, cache lines, alignas, false sharing, scenarios, AoS→SoA examples, pitfalls, benchmarks, engine/simulation patterns.
Table of contents
- Why cache optimization matters
- Data-oriented design
- Cache lines & alignment
- False sharing & padding
- Complete examples
- Common mistakes
- Benchmarks
- Production patterns
- Summary
1. Why cache optimization matters
- 100k entities: updating position only still loads velocity/color/id in AoS → wasted bandwidth.
- More threads slower: false sharing on adjacent counters.
- SIMD won’t vectorize: AoS scatters
xacross strides.
2. Data-oriented design
flowchart TB
subgraph AoS["AoS"]
E1["Entity0: pos, vel, id"]
E2["Entity1: ..."]
end
subgraph SoA["SoA"]
X["x[]"]
Y["y[]"]
Z["z[]"]
end
AoS -->|"position-only loop"| Waste["Loads unused fields"]
SoA -->|"position-only loop"| Hit["Sequential x,y,z"]
Rule of thumb: thousands+ entities, field-specific hot loops, SIMD → SoA. Small counts (<~100–1000) may favor simpler AoS.
3. Cache lines & alignment
Typical 64-byte lines. alignas(64) hot atomics/counters to separate lines. Use std::hardware_destructive_interference_size (C++17) when available.
4. False sharing
Independent variables on the same cache line invalidate each other across cores. Fix with line-sized padding or per-thread shards.
5. Complete examples
This series also walks through a full particle AoS vs SoA benchmark and padded atomic counters for parallel increments—adapt the code and comments to your codebase.
6. Common mistakes
- SoA index mismatch after partial deletes—use swap-with-last across all arrays.
- Over-padding everything—only hot written fields need isolation.
- SoA with random indices loses locality—sort/pack active entities.
7. Benchmarks
Use perf stat -e cache-misses,cache-references and Release (-O3 -march=native) builds.
8. Production patterns
ECS-style component arrays, batch processing, blocked matrix multiply, hybrid AoSoA blocks.
9. Summary
| Topic | Takeaway |
|---|---|
| DoD | Prefer SoA when loops are field-specific |
| Cache line | 64B, alignas for sharing avoidance |
| False sharing | pad or shard counters |
| Production | measure with perf, profile hot loops |
Keywords
data-oriented design, cache optimization, AoS SoA, false sharing, cache line, SIMD
Next: Custom allocators & pmr (#39-2)
Previous: PIMPL & ABI (#38-3)