Cache-Friendly C++: Data-Oriented Design and AoS vs SoA
이 글의 핵심
CPU speed has outpaced memory bandwidth. How you lay out data in memory determines whether your hot loops run at full speed or wait for cache misses. This guide covers AoS vs SoA, cache line alignment, and false sharing with measurable examples.
Why Memory Layout Decides Performance
Modern CPUs can execute billions of instructions per second, but main memory delivers data at a much slower rate. When the CPU needs data that isn’t in cache, it stalls and waits — a cache miss typically costs 100-300 cycles.
A well-tuned algorithm with poor memory layout often runs slower than a naive algorithm with cache-friendly layout. This is the core insight of data-oriented design (DoD): design your data structures for how the CPU accesses them, not just for logical convenience.
L1 cache hit: ~4 cycles (data already in L1)
L2 cache hit: ~12 cycles
L3 cache hit: ~40 cycles
RAM: ~200 cycles (50x slower than L1)
Cache Lines — The Unit of Memory Transfer
The CPU doesn’t fetch individual bytes from memory. It fetches cache lines — typically 64 contiguous bytes. When you access one byte, the surrounding 63 bytes come along for free.
This matters for data layout:
- If the next data you need is in the same cache line, it’s free
- If it’s in a different cache line, you pay the full cache miss penalty
// Each line: 64 bytes = 16 ints
int arr[1000];
// Sequential access — cache-friendly
// After loading arr[0], arr[1]..arr[15] are already in cache
for (int i = 0; i < 1000; i++) {
process(arr[i]); // hits L1 ~15 out of 16 accesses
}
// Strided access — cache-unfriendly
// Jumps 64 bytes each iteration — every access may miss
for (int i = 0; i < 1000; i += 16) {
process(arr[i]); // potential cache miss every time
}
Array of Structures vs Structure of Arrays
The most impactful layout decision for performance-critical code.
Array of Structures (AoS)
The natural object-oriented layout:
struct Particle {
float x, y, z; // position (12 bytes)
float vx, vy, vz; // velocity (12 bytes)
float mass; // (4 bytes)
int id; // (4 bytes)
uint32_t color; // (4 bytes)
// total: ~40 bytes per particle
};
std::vector<Particle> particles(100'000);
// Position-only update loop
for (auto& p : particles) {
p.x += p.vx * dt;
p.y += p.vy * dt;
p.z += p.vz * dt;
// mass, id, color are LOADED but NEVER USED — wasted bandwidth
}
With 40-byte structs, each cache line (64 bytes) holds ~1.6 particles. The position update loop accesses x, vx, y, vy, z, vz — which are all in the same struct. But it also loads mass, id, color from every cache line, wasting 28% of memory bandwidth.
Structure of Arrays (SoA)
struct Particles {
std::vector<float> x, y, z; // all x positions together
std::vector<float> vx, vy, vz; // all velocities together
std::vector<float> mass;
std::vector<int> id;
std::vector<uint32_t> color;
std::size_t count;
};
Particles particles;
particles.count = 100'000;
particles.x.resize(100'000);
// ... initialize all arrays
// Position-only update loop — reads only what it needs
for (std::size_t i = 0; i < particles.count; ++i) {
particles.x[i] += particles.vx[i] * dt;
particles.y[i] += particles.vy[i] * dt;
particles.z[i] += particles.vz[i] * dt;
}
// `mass`, `id`, `color` arrays are never touched — zero wasted bandwidth
// x[0]..x[15] fit in one cache line, prefetching ahead works perfectly
For SIMD (AVX2 processes 8 floats at once), SoA aligns perfectly — you can load 8 consecutive x values and 8 consecutive vx values into SIMD registers and update them all at once.
When AoS Is Better
SoA is not universally better:
- Small counts (< ~1000 objects): cache behavior difference is negligible
- Accessing many fields at once: if your loop touches
x, y, z, massevery time, AoS keeps them together - Random access by object: if you frequently look up a single particle by index and use all its data, AoS has better locality for that use case
- Code clarity: SoA code is more verbose and error-prone
Rule: profile first. If a loop touches fewer than half the struct’s fields on most iterations, SoA is likely faster.
Cache Line Alignment
Basic Alignment
#include <cstddef>
#include <new> // for std::hardware_destructive_interference_size
// Align a struct to the cache line boundary
struct alignas(64) HotData {
int counter;
float accumulated;
// ... rest of the fields
};
// C++17 portable cache line size
constexpr std::size_t CACHE_LINE = std::hardware_destructive_interference_size;
struct alignas(CACHE_LINE) ThreadLocal {
long count;
double sum;
};
Alignment matters most for data that multiple threads write to, or that is accessed in tight loops where cache line boundaries matter.
Checking Your Layout
#include <iostream>
struct AoS_Particle {
float x, y, z; // 12 bytes
float vx, vy, vz; // 12 bytes
int id; // 4 bytes
};
// Total: 28 bytes — 2.28 particles per 64-byte cache line
int main() {
std::cout << "Particle size: " << sizeof(AoS_Particle) << '\n';
std::cout << "Particles per cache line: "
<< 64.0 / sizeof(AoS_Particle) << '\n';
// Particles per cache line: 2.28
}
False Sharing
False sharing is the silent multi-threaded performance killer. Two threads write to different variables, but those variables share a cache line:
// PROBLEM: both counters on the same cache line
struct Counters {
long thread0_count; // bytes 0-7
long thread1_count; // bytes 8-15
// Both in the same 64-byte cache line!
};
Counters shared;
// Thread 0 writes shared.thread0_count → invalidates the cache line on Thread 1
// Thread 1 writes shared.thread1_count → invalidates the cache line on Thread 0
// They fight over the same cache line even though they write different data
Fix: pad to separate cache lines
struct alignas(64) PaddedCounter {
long count;
char padding[64 - sizeof(long)]; // fill the rest of the cache line
};
// Or simpler with alignas:
struct ThreadCounters {
alignas(64) long thread0_count;
alignas(64) long thread1_count;
// Each is on its own cache line — no interference
};
// Or use hardware_destructive_interference_size:
struct SafeCounters {
alignas(std::hardware_destructive_interference_size) long thread0_count;
alignas(std::hardware_destructive_interference_size) long thread1_count;
};
Measuring false sharing:
# Linux: watch cache line invalidations with perf
perf stat -e cache-misses,cache-references,LLC-load-misses ./my_program
# Compare with padded vs unpadded version
# False sharing shows as high cache miss rate even with small working sets
SoA Pitfalls
Index Mismatch on Delete
SoA stores parallel arrays — indices must stay consistent across all arrays:
// The swap-with-last delete — must apply to ALL arrays
void deleteParticle(Particles& p, std::size_t idx) {
std::size_t last = p.count - 1;
// Swap with last in EVERY array
std::swap(p.x[idx], p.x[last]);
std::swap(p.y[idx], p.y[last]);
std::swap(p.z[idx], p.z[last]);
std::swap(p.vx[idx], p.vx[last]);
std::swap(p.vy[idx], p.vy[last]);
std::swap(p.vz[idx], p.vz[last]);
std::swap(p.mass[idx], p.mass[last]);
std::swap(p.id[idx], p.id[last]);
p.count--;
// Forget one array → silent data corruption
}
Random Access in SoA
If you frequently access all fields of a single object by index, SoA is slower — each field access may be in a different cache line:
// SoA — random access by index fetches from many cache lines
void printParticle(const Particles& p, std::size_t idx) {
std::cout << p.x[idx] << ' ' // cache miss
<< p.y[idx] << ' ' // cache miss (different array)
<< p.z[idx] << ' ' // cache miss
<< p.mass[idx] << '\n'; // cache miss
}
For mixed access patterns, AoSoA (Array of Structures of Arrays) is sometimes the best compromise — groups of 8 or 16 objects packed together for SIMD, multiple groups forming the outer array.
Production Patterns
Entity Component System (ECS)
ECS is the mainstream game engine approach to data-oriented design:
// Each component type is stored in a contiguous array
// Systems iterate component arrays rather than game objects
struct PositionComponent { float x, y, z; };
struct VelocityComponent { float vx, vy, vz; };
struct RenderComponent { uint32_t mesh_id; uint32_t material_id; };
// Physics system touches only Position + Velocity (tight loop, great cache use)
void physicsSystem(
std::vector<PositionComponent>& positions,
const std::vector<VelocityComponent>& velocities,
float dt)
{
for (std::size_t i = 0; i < positions.size(); ++i) {
positions[i].x += velocities[i].vx * dt;
positions[i].y += velocities[i].vy * dt;
positions[i].z += velocities[i].vz * dt;
}
}
// Render system touches only Position + Render
void renderSystem(
const std::vector<PositionComponent>& positions,
const std::vector<RenderComponent>& renders)
{
for (std::size_t i = 0; i < positions.size(); ++i) {
submitDrawCall(renders[i].mesh_id, renders[i].material_id,
positions[i].x, positions[i].y, positions[i].z);
}
}
Key Takeaways
- Cache misses cost ~50-200x more than L1 hits — data layout is as important as algorithm choice
- AoS is natural for OOP; SoA wins when loops access only a few fields from large arrays
- Cache lines are 64 bytes — pack frequently co-accessed data within 64-byte boundaries
- False sharing occurs when two threads write different variables that share a cache line — use
alignas(64)to isolate hot written data - SoA pitfalls: maintain index consistency across all parallel arrays, and avoid random single-object lookups
- Measure with
perf stat -e cache-missesbefore optimizing — false sharing and AoS/SoA problems look similar in profiles - ECS (Entity Component System) in game engines is the mainstream application of SoA principles at scale
자주 묻는 질문 (FAQ)
Q. 이 내용을 실무에서 언제 쓰나요?
A. Data-oriented design for C++ performance: AoS vs SoA layout, cache lines, false sharing, alignas, benchmarking with perf… 실무에서는 위 본문의 예제와 선택 가이드를 참고해 적용하면 됩니다.
Q. 선행으로 읽으면 좋은 글은?
A. 각 글 하단의 이전 글 또는 관련 글 링크를 따라가면 순서대로 배울 수 있습니다. C++ 시리즈 목차에서 전체 흐름을 확인할 수 있습니다.
Q. 더 깊이 공부하려면?
A. cppreference와 해당 라이브러리 공식 문서를 참고하세요. 글 말미의 참고 자료 링크도 활용하면 좋습니다.
같이 보면 좋은 글 (내부 링크)
이 주제와 연결되는 다른 글입니다.
- C++ 캐시 최적화 실전 | 캐시 친화적 구조·프리페치·False Sharing·AoS vs SoA 가이드
- C++ Cache Friendly 코드 작성법 | 메모리 접근 패턴으로 성능 10배 향상
- C++ 메모리 정렬 | Alignment·Padding·False Sharing 완벽 정리
이 글에서 다루는 키워드 (관련 검색어)
C++, Performance, Data-Oriented Design, Cache, Cache Line, SoA, AoS 등으로 검색하시면 이 글이 도움이 됩니다.