C++ Branch Prediction | Complete Guide to likely/unlikely Optimization
이 글의 핵심
C++ branch prediction: CPU pipeline, misprediction penalty, [[likely]]/[[unlikely]], branch elimination, sorting effects, and PGO with practical examples.
Introduction
Modern CPUs use pipeline and superscalar architectures to overlap execution of multiple instructions. When branches in if or loops split the address of next instruction to execute based on conditions, we don’t know “which path will execute” until the condition is computed. So CPUs use branch predictor to perform speculative execution on one path based on past patterns. If correct, proceed; if wrong, pipeline flush (misprediction penalty) wastes dozens of cycles.
What You’ll Learn
- Understand branch prediction principles and misprediction penalty
- Provide hints to compiler with
[[likely]],[[unlikely]] - Optimize performance with branch elimination, sorting, and PGO
- Master commonly used branch optimization patterns in production
Table of Contents
- Basic Concepts
- Practical Implementation
- Advanced Usage
- Performance Comparison
- Real-World Cases
- Troubleshooting
- Conclusion
Basic Concepts
CPU Branch Prediction Principles
- Sequential execution assumption: Pipeline normally fetches “next address” sequentially.
- Conditional branch: When destination has two or more branches before condition is determined, predictor chooses one side.
- Misprediction: Discard work already started on wrong path and refill with correct address. This cost repeated millions of times in loops significantly impacts perceived performance.
Predictable vs Unpredictable
// Predictable branch: fast
for (int i = 0; i < n; ++i) {
if (i < n/2) { // Always same pattern
// ...
}
}
// Unpredictable branch: slow
for (int i = 0; i < n; ++i) {
if (data[i] % 2 == 0) { // Random
// ...
}
}
Practical Implementation
1) [[likely]] / [[unlikely]] (C++20)
#include <iostream>
#include <stdexcept>
int divide(int a, int b) {
if (b == 0) [[unlikely]] {
throw std::runtime_error("Division by zero");
}
return a / b;
}
void process(int* data, int n) {
for (int i = 0; i < n; ++i) {
if (data[i] > 0) [[likely]] {
// Mostly positive
processPositive(data[i]);
} else {
processNegative(data[i]);
}
}
}
int main() {
int result = divide(10, 2);
std::cout << result << std::endl;
return 0;
}
2) Branch Elimination (Branchless)
Conditional Move (cmov)
// ❌ Branch
int max(int a, int b) {
if (a > b) {
return a;
} else {
return b;
}
}
// ✅ Conditional move
int max(int a, int b) {
return (a > b) ? a : b;
}
// Compiler generates cmov instruction
Using Masks
#include <vector>
#include <chrono>
#include <iostream>
int main() {
std::vector<int> data(10000000);
for (int i = 0; i < data.size(); ++i) {
data[i] = i % 100 - 50;
}
// ❌ Many branches
auto start1 = std::chrono::high_resolution_clock::now();
int sum1 = 0;
for (int x : data) {
if (x > 0) {
sum1 += x;
}
}
auto end1 = std::chrono::high_resolution_clock::now();
auto time1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
// ✅ Branch elimination
auto start2 = std::chrono::high_resolution_clock::now();
int sum2 = 0;
for (int x : data) {
int mask = (x > 0) ? 1 : 0;
sum2 += x * mask;
}
auto end2 = std::chrono::high_resolution_clock::now();
auto time2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count();
std::cout << "With branches: " << time1 << "ms" << std::endl;
std::cout << "Branchless: " << time2 << "ms" << std::endl;
return 0;
}
3) Sorting Effect
#include <algorithm>
#include <vector>
#include <random>
#include <chrono>
#include <iostream>
int main() {
std::vector<int> data(10000000);
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> dis(0, 10000);
std::generate(data.begin(), data.end(), [&]() { return dis(gen); });
// ❌ Random data (unpredictable)
auto start1 = std::chrono::high_resolution_clock::now();
int sum1 = 0;
for (int x : data) {
if (x > 5000) {
sum1 += x;
}
}
auto end1 = std::chrono::high_resolution_clock::now();
auto time1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
// ✅ After sorting (predictable)
std::sort(data.begin(), data.end());
auto start2 = std::chrono::high_resolution_clock::now();
int sum2 = 0;
for (int x : data) {
if (x > 5000) {
sum2 += x;
}
}
auto end2 = std::chrono::high_resolution_clock::now();
auto time2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count();
std::cout << "Random: " << time1 << "ms" << std::endl;
std::cout << "Sorted: " << time2 << "ms" << std::endl;
return 0;
}
Result: 2-3x faster after sorting
Advanced Usage
1) PGO (Profile-Guided Optimization)
Profile Generation
# GCC/Clang
g++ -O3 -fprofile-generate program.cpp -o program
./program # Run representative workload
g++ -O3 -fprofile-use program.cpp -o program_optimized
MSVC
# Generate profile
cl /O2 /GL /LTCG:PGI program.cpp
program.exe # Run representative workload
# Use profile
cl /O2 /GL /LTCG:PGO program.cpp
2) Lookup Tables
// ❌ Many branches
int getDayName(int day) {
if (day == 0) return "Sunday";
if (day == 1) return "Monday";
if (day == 2) return "Tuesday";
// ...
}
// ✅ Lookup table
const char* DAY_NAMES[] = {
"Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday"
};
const char* getDayName(int day) {
return DAY_NAMES[day];
}
3) Virtual Function Optimization
// ❌ Virtual function (indirect branch)
class Shape {
public:
virtual double area() const = 0;
};
std::vector<Shape*> shapes;
double total = 0;
for (auto* shape : shapes) {
total += shape->area(); // Indirect branch
}
// ✅ Separate by type
std::vector<Circle> circles;
std::vector<Rectangle> rectangles;
double total = 0;
for (const auto& circle : circles) {
total += circle.area(); // Direct call
}
for (const auto& rect : rectangles) {
total += rect.area();
}
Performance Comparison
Branch Misprediction Cost
Test: 10 million iterations
| Branch Pattern | Time | Speedup |
|---|---|---|
| Predictable (always true) | 10ms | 10x |
| Predictable (always false) | 10ms | 10x |
| Unpredictable (random) | 100ms | 1x |
| Conclusion: 10x slower on misprediction |
Branch Elimination Effect
Test: 10 million conditional operations
| Method | Time | Speedup |
|---|---|---|
| Branch (random) | 100ms | 1x |
| Branchless (mask) | 50ms | 2x |
| Conclusion: 2x improvement with branch elimination |
Sorting Effect
Test: 10 million random data points
| Method | Time | Speedup |
|---|---|---|
| Random data | 100ms | 1x |
| After sorting | 30ms | 3.3x |
| Conclusion: 3x improvement with sorting |
Real-World Cases
Case 1: Packet Filtering
#include <vector>
#include <chrono>
#include <iostream>
struct Packet {
int type;
int size;
};
void filterPackets(const std::vector<Packet>& packets) {
int count = 0;
for (const auto& packet : packets) {
if (packet.type == 1) [[likely]] {
// Mostly type 1
count++;
}
}
std::cout << "Type 1 packets: " << count << std::endl;
}
int main() {
std::vector<Packet> packets(10000000);
for (auto& p : packets) {
p.type = (rand() % 100 < 90) ? 1 : 2; // 90% type 1
p.size = 1500;
}
auto start = std::chrono::high_resolution_clock::now();
filterPackets(packets);
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << "Time: " << duration << "ms" << std::endl;
return 0;
}
Case 2: Image Processing - Threshold Filter
#include <vector>
#include <algorithm>
#include <chrono>
#include <iostream>
void applyThreshold(std::vector<uint8_t>& image, uint8_t threshold) {
// ❌ Many branches
auto start1 = std::chrono::high_resolution_clock::now();
for (auto& pixel : image) {
if (pixel > threshold) {
pixel = 255;
} else {
pixel = 0;
}
}
auto end1 = std::chrono::high_resolution_clock::now();
auto time1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
std::cout << "With branches: " << time1 << "ms" << std::endl;
}
void applyThresholdBranchless(std::vector<uint8_t>& image, uint8_t threshold) {
// ✅ Branch elimination
auto start2 = std::chrono::high_resolution_clock::now();
for (auto& pixel : image) {
int mask = (pixel > threshold) ? 0xFF : 0x00;
pixel = mask;
}
auto end2 = std::chrono::high_resolution_clock::now();
auto time2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count();
std::cout << "Branchless: " << time2 << "ms" << std::endl;
}
int main() {
std::vector<uint8_t> image(10000000);
std::generate(image.begin(), image.end(), []() { return rand() % 256; });
applyThreshold(image, 128);
applyThresholdBranchless(image, 128);
return 0;
}
Case 3: Finance - Conditional Fees
#include <vector>
#include <chrono>
#include <iostream>
struct Transaction {
double amount;
int type;
};
double calculateFees(const std::vector<Transaction>& transactions) {
double total_fee = 0.0;
for (const auto& tx : transactions) {
if (tx.type == 1) [[likely]] {
// Regular transaction (90%)
total_fee += tx.amount * 0.001;
} else if (tx.type == 2) {
// Special transaction (9%)
total_fee += tx.amount * 0.002;
} else [[unlikely]] {
// Exception transaction (1%)
total_fee += tx.amount * 0.005;
}
}
return total_fee;
}
int main() {
std::vector<Transaction> transactions(10000000);
for (auto& tx : transactions) {
int r = rand() % 100;
tx.type = (r < 90) ? 1 : (r < 99) ? 2 : 3;
tx.amount = 1000.0;
}
auto start = std::chrono::high_resolution_clock::now();
double fee = calculateFees(transactions);
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << "Fee: " << fee << std::endl;
std::cout << "Time: " << duration << "ms" << std::endl;
return 0;
}
Troubleshooting
Problem 1: Excessive Hints
Symptom: Wrong hints causing performance degradation
// ❌ Wrong hint
if (condition) [[unlikely]] {
// Actually executes frequently (50%)
// Performance degradation
}
// ✅ Apply after profiling
// perf stat -e branch-misses ./program
Problem 2: Branch Elimination Cost
Symptom: Branch elimination actually slower
// Simple branch (predictable)
if (x > 0) {
y = x;
} else {
y = 0;
}
// Branch elimination (not always faster)
y = (x > 0) ? x : 0;
// ✅ Compiler optimizes
// Choose after measurement
Problem 3: Sorting Cost vs Benefit
Symptom: Sorting cost exceeds branch prediction benefit
std::vector<int> data = generateData(1000);
// Single pass: don't sort
for (int x : data) {
if (x > threshold) { /* ....*/ }
}
// Multiple passes: sort
std::sort(data.begin(), data.end());
for (int i = 0; i < 100; ++i) {
for (int x : data) {
if (x > threshold) { /* ....*/ }
}
}
Guideline: Consider sorting if passes > 10
Problem 4: Platform Dependency
Symptom: Fast on one CPU but slow on another
# Measure on target CPU
perf stat -e branch-misses,branches ./program
# Output:
# 1,234,567 branch-misses
# 10,000,000 branches
# 12.3% miss rate
Solution: Benchmark on target CPU class
Conclusion
C++ branch prediction is a core element of performance optimization.
Key Summary
- Branch Prediction
- CPU speculatively executes branch result
- Dozens of cycles lost on misprediction
- [[likely]] / [[unlikely]]
- C++20 standard attributes
- Hints to compiler
- Branch Elimination
- Conditional move (cmov)
- Using masks
- Sorting Effect
- Predictable patterns
- 3x performance improvement
- PGO
- Based on actual workload
- Automatic optimization
Optimization Techniques
| Technique | Effect | Difficulty |
|---|---|---|
[[likely]]/[[unlikely]] | 1.2-1.5x | Low |
| Branch elimination | 2x | Medium |
| Sorting | 3x | Low |
| PGO | 1.5-2x | Medium |
Code Example Cheatsheet
// likely/unlikely
if (rare) [[unlikely]] { /* ....*/ }
// Branch elimination
result = (condition) ? a : b;
// Mask
int mask = (x > 0) ? 1 : 0;
sum += x * mask;
// Sorting
std::sort(data.begin(), data.end());
// Lookup table
result = table[index];
Next Steps
- Cache Optimization: C++ Cache Optimization
- Performance Optimization: C++ Performance Optimization
- Profiling: C++ Profiling
References
- “Computer Architecture: A Quantitative Approach” - Hennessy, Patterson
- “Optimized C++” - Kurt Guntheroth
- “Agner Fog’s Optimization Manuals” One-line summary: Branch prediction is fast with predictable patterns, and performance can be improved 2-3x with [[likely]]/[[unlikely]], branch elimination, sorting, and PGO.