분기 예측을 위해 if 순서만 바꾸면 성능이 오르나요?

실제 분포(희귀 경로 빈도)에 따라 달라집니다. 추측보다 프로파일러로 핫 스팟과 브랜치 미스율을 먼저 확인하고, 의미를 바꾸지 않는 범위에서만 재배열하세요.

likely/unlikely 같은 힌트는 표준인가요?

C++20 표준 속성입니다. 컴파일러에 힌트를 주지만, 실제 예측 정확도는 워크로드와 마이크로아키텍처에 의존합니다.

예외 throw가 분기 예측에 나쁜가요?

예외는 희귀 경로로 두는 것이 일반적이라 정상 경로 예측에는 유리할 수 있지만, 실제 비용은 RTTI·테이블·코드 크기까지 포함됩니다. 핫 루프에서는 오류 코드나 `expected`를 비교해 보세요.

스위치문이 if 체인보다 항상 빠른가요?

케이스가 밀집된 정수면 점프 테이블에 유리할 수 있지만 보장은 아닙니다. 컴파일러 최적화와 데이터 분포에 따라 달라져 측정이 필요합니다.

C++ Branch Prediction | 분기 예측·likely·unlikely 완벽 정리

2026년 3월 12일 · 32분 읽기 · 수정 2026년 3월 31일 고급 튜토리얼

이 글의 핵심

C++ 분기 예측: CPU 파이프라인, misprediction penalty, [[likely]]/[[unlikely]], 분기 제거, 정렬 효과, PGO를 실전 예제와 함께 정리합니다.

들어가며

현대 CPU는 파이프라인과 슈퍼스칼라 구조로 여러 명령을 겹쳐 실행합니다. if나 루프 분기에서 다음에 실행할 명령의 주소가 조건에 따라 갈라지면, 조건이 계산되기 전까지는 “어느 쪽이 실행될지”를 모릅니다. 그래서 CPU는 분기 예측기(branch predictor)가 과거 패턴을 바탕으로 한쪽 경로를 추측 실행(speculative execution)합니다. 맞으면 그대로 진행하고, 틀리면 파이프라인 플러시(misprediction penalty)로 수십 사이클을 날립니다.

이 글을 읽으면

분기 예측의 원리와 misprediction penalty를 이해합니다
[[likely]], [[unlikely]]로 컴파일러에 힌트를 줍니다
분기 제거, 정렬, PGO로 성능을 최적화합니다
실무에서 자주 쓰이는 분기 최적화 패턴을 익힙니다

기본 개념

CPU 분기 예측 원리

순차 실행 가정: 파이프라인은 보통 “다음 주소”를 연속으로 가져옵니다.
조건부 분기: 조건이 확정되기 전에 목적지가 두 갈래 이상이면, 예측기가 한쪽을 선택합니다.
미스 예측: 잘못 고른 경로에서 이미 시작한 작업을 버리고 올바른 주소로 다시 채웁니다. 이 비용이 루프 내부에서 수백만 번 반복되면 체감 성능에 큰 영향을 줍니다.

예측 가능 vs 예측 불가능

// 예측 가능한 분기: 빠름
for (int i = 0; i < n; ++i) {
    if (i < n/2) {  // 항상 같은 패턴
        // ...
    }
}

// 예측 불가능한 분기: 느림
for (int i = 0; i < n; ++i) {
    if (data[i] % 2 == 0) {  // 랜덤
        // ...
    }
}

실전 구현

1) [[likely]] / [[unlikely]] (C++20)

#include <iostream>
#include <stdexcept>

int divide(int a, int b) {
    if (b == 0) [[unlikely]] {
        throw std::runtime_error("0으로 나눔");
    }
    
    return a / b;
}

void process(int* data, int n) {
    for (int i = 0; i < n; ++i) {
        if (data[i] > 0) [[likely]] {
            // 대부분 양수
            processPositive(data[i]);
        } else {
            processNegative(data[i]);
        }
    }
}

int main() {
    int result = divide(10, 2);
    std::cout << result << std::endl;
    
    return 0;
}

2) 분기 제거 (Branchless)

조건부 이동 (cmov)

// ❌ 분기
int max(int a, int b) {
    if (a > b) {
        return a;
    } else {
        return b;
    }
}

// ✅ 조건부 이동
int max(int a, int b) {
    return (a > b) ? a : b;
}

// 컴파일러가 cmov 명령 생성

마스크 사용

#include <vector>
#include <chrono>
#include <iostream>

int main() {
    std::vector<int> data(10000000);
    for (int i = 0; i < data.size(); ++i) {
        data[i] = i % 100 - 50;
    }
    
    // ❌ 분기 많음
    auto start1 = std::chrono::high_resolution_clock::now();
    int sum1 = 0;
    for (int x : data) {
        if (x > 0) {
            sum1 += x;
        }
    }
    auto end1 = std::chrono::high_resolution_clock::now();
    auto time1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
    
    // ✅ 분기 제거
    auto start2 = std::chrono::high_resolution_clock::now();
    int sum2 = 0;
    for (int x : data) {
        int mask = (x > 0) ? 1 : 0;
        sum2 += x * mask;
    }
    auto end2 = std::chrono::high_resolution_clock::now();
    auto time2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count();
    
    std::cout << "분기: " << time1 << "ms" << std::endl;
    std::cout << "분기 제거: " << time2 << "ms" << std::endl;
    
    return 0;
}

3) 정렬 효과

#include <algorithm>
#include <vector>
#include <random>
#include <chrono>
#include <iostream>

int main() {
    std::vector<int> data(10000000);
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(0, 10000);
    std::generate(data.begin(), data.end(), [&]() { return dis(gen); });
    
    // ❌ 랜덤 데이터 (예측 불가)
    auto start1 = std::chrono::high_resolution_clock::now();
    int sum1 = 0;
    for (int x : data) {
        if (x > 5000) {
            sum1 += x;
        }
    }
    auto end1 = std::chrono::high_resolution_clock::now();
    auto time1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
    
    // ✅ 정렬 후 (예측 가능)
    std::sort(data.begin(), data.end());
    
    auto start2 = std::chrono::high_resolution_clock::now();
    int sum2 = 0;
    for (int x : data) {
        if (x > 5000) {
            sum2 += x;
        }
    }
    auto end2 = std::chrono::high_resolution_clock::now();
    auto time2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count();
    
    std::cout << "랜덤: " << time1 << "ms" << std::endl;
    std::cout << "정렬: " << time2 << "ms" << std::endl;
    
    return 0;
}

결과: 정렬 후 2-3배 빠름

고급 활용

1) PGO (Profile-Guided Optimization)

프로파일 생성

# GCC/Clang
g++ -O3 -fprofile-generate program.cpp -o program
./program  # 대표 워크로드 실행
g++ -O3 -fprofile-use program.cpp -o program_optimized

MSVC

# 프로파일 생성
cl /O2 /GL /LTCG:PGI program.cpp
program.exe  # 대표 워크로드 실행

# 프로파일 사용
cl /O2 /GL /LTCG:PGO program.cpp

2) 룩업 테이블

// ❌ 분기 많음
int getDayName(int day) {
    if (day == 0) return "Sunday";
    if (day == 1) return "Monday";
    if (day == 2) return "Tuesday";
    // ...
}

// ✅ 룩업 테이블
const char* DAY_NAMES[] = {
    "Sunday", "Monday", "Tuesday", "Wednesday",
    "Thursday", "Friday", "Saturday"
};

const char* getDayName(int day) {
    return DAY_NAMES[day];
}

3) 가상 함수 최적화

// ❌ 가상 함수 (간접 분기)
class Shape {
public:
    virtual double area() const = 0;
};

std::vector<Shape*> shapes;
double total = 0;
for (auto* shape : shapes) {
    total += shape->area();  // 간접 분기
}

// ✅ 타입별 분리
std::vector<Circle> circles;
std::vector<Rectangle> rectangles;

double total = 0;
for (const auto& circle : circles) {
    total += circle.area();  // 직접 호출
}
for (const auto& rect : rectangles) {
    total += rect.area();
}

성능 비교

분기 예측 실패 비용

테스트: 1천만 번 반복

분기 패턴	시간	배속
예측 가능 (항상 true)	10ms	10x
예측 가능 (항상 false)	10ms	10x
예측 불가능 (랜덤)	100ms	1x

결론: 예측 실패 시 10배 느림

분기 제거 효과

테스트: 1천만 번 조건 처리

방법	시간	배속
분기 (랜덤)	100ms	1x
분기 제거 (마스크)	50ms	2x

결론: 분기 제거로 2배 개선

정렬 효과

테스트: 1천만 개 랜덤 데이터

방법	시간	배속
랜덤 데이터	100ms	1x
정렬 후	30ms	3.3x

결론: 정렬로 3배 개선

실무 사례

사례 1: 패킷 필터링

#include <vector>
#include <chrono>
#include <iostream>

struct Packet {
    int type;
    int size;
};

void filterPackets(const std::vector<Packet>& packets) {
    int count = 0;
    
    for (const auto& packet : packets) {
        if (packet.type == 1) [[likely]] {
            // 대부분 타입 1
            count++;
        }
    }
    
    std::cout << "타입 1 패킷: " << count << std::endl;
}

int main() {
    std::vector<Packet> packets(10000000);
    for (auto& p : packets) {
        p.type = (rand() % 100 < 90) ? 1 : 2;  // 90% 타입 1
        p.size = 1500;
    }
    
    auto start = std::chrono::high_resolution_clock::now();
    filterPackets(packets);
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    
    std::cout << "시간: " << duration << "ms" << std::endl;
    
    return 0;
}

사례 2: 이미지 처리 - 임계값 필터

#include <vector>
#include <algorithm>
#include <chrono>
#include <iostream>

void applyThreshold(std::vector<uint8_t>& image, uint8_t threshold) {
    // ❌ 분기 많음
    auto start1 = std::chrono::high_resolution_clock::now();
    for (auto& pixel : image) {
        if (pixel > threshold) {
            pixel = 255;
        } else {
            pixel = 0;
        }
    }
    auto end1 = std::chrono::high_resolution_clock::now();
    auto time1 = std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start1).count();
    
    std::cout << "분기: " << time1 << "ms" << std::endl;
}

void applyThresholdBranchless(std::vector<uint8_t>& image, uint8_t threshold) {
    // ✅ 분기 제거
    auto start2 = std::chrono::high_resolution_clock::now();
    for (auto& pixel : image) {
        int mask = (pixel > threshold) ? 0xFF : 0x00;
        pixel = mask;
    }
    auto end2 = std::chrono::high_resolution_clock::now();
    auto time2 = std::chrono::duration_cast<std::chrono::milliseconds>(end2 - start2).count();
    
    std::cout << "분기 제거: " << time2 << "ms" << std::endl;
}

int main() {
    std::vector<uint8_t> image(10000000);
    std::generate(image.begin(), image.end(), []() { return rand() % 256; });
    
    applyThreshold(image, 128);
    applyThresholdBranchless(image, 128);
    
    return 0;
}

사례 3: 금융 - 조건부 수수료

#include <vector>
#include <chrono>
#include <iostream>

struct Transaction {
    double amount;
    int type;
};

double calculateFees(const std::vector<Transaction>& transactions) {
    double total_fee = 0.0;
    
    for (const auto& tx : transactions) {
        if (tx.type == 1) [[likely]] {
            // 일반 거래 (90%)
            total_fee += tx.amount * 0.001;
        } else if (tx.type == 2) {
            // 특수 거래 (9%)
            total_fee += tx.amount * 0.002;
        } else [[unlikely]] {
            // 예외 거래 (1%)
            total_fee += tx.amount * 0.005;
        }
    }
    
    return total_fee;
}

int main() {
    std::vector<Transaction> transactions(10000000);
    for (auto& tx : transactions) {
        int r = rand() % 100;
        tx.type = (r < 90) ? 1 : (r < 99) ? 2 : 3;
        tx.amount = 1000.0;
    }
    
    auto start = std::chrono::high_resolution_clock::now();
    double fee = calculateFees(transactions);
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    
    std::cout << "수수료: " << fee << std::endl;
    std::cout << "시간: " << duration << "ms" << std::endl;
    
    return 0;
}

트러블슈팅

문제 1: 과도한 힌트

증상: 잘못된 힌트로 성능 저하

// ❌ 잘못된 힌트
if (condition) [[unlikely]] {
    // 실제로는 자주 실행 (50%)
    // 성능 저하
}

// ✅ 프로파일링 후 적용
// perf stat -e branch-misses ./program

문제 2: 분기 제거 비용

증상: 분기 제거가 오히려 느림

// 간단한 분기 (예측 가능)
if (x > 0) {
    y = x;
} else {
    y = 0;
}

// 분기 제거 (항상 빠른 것은 아님)
y = (x > 0) ? x : 0;

// ✅ 컴파일러가 최적화
// 측정 후 선택

문제 3: 정렬 비용 vs 이득

증상: 정렬 비용이 분기 예측 이득보다 큼

std::vector<int> data = generateData(1000);

// 한 번만 순회: 정렬 안 함
for (int x : data) {
    if (x > threshold) { /* ... */ }
}

// 여러 번 순회: 정렬
std::sort(data.begin(), data.end());
for (int i = 0; i < 100; ++i) {
    for (int x : data) {
        if (x > threshold) { /* ... */ }
    }
}

기준: 순회 횟수 > 10회면 정렬 고려

문제 4: 플랫폼 의존성

증상: 한 CPU에서는 빠르지만 다른 CPU에서는 느림

# 타깃 CPU에서 측정
perf stat -e branch-misses,branches ./program

# 출력:
# 1,234,567 branch-misses
# 10,000,000 branches
# 12.3% miss rate

해결: 타깃 CPU 클래스에서 벤치마크

마무리

C++ 분기 예측은 성능 최적화의 핵심 요소입니다.

핵심 요약

분기 예측
- CPU가 분기 결과를 추측 실행
- 미스 예측 시 수십 사이클 손실
[[likely]] / [[unlikely]]
- C++20 표준 속성
- 컴파일러에 힌트
분기 제거
- 조건부 이동 (cmov)
- 마스크 사용
정렬 효과
- 예측 가능한 패턴
- 3배 성능 개선
PGO
- 실제 워크로드 기반
- 자동 최적화

최적화 기법

기법	효과	난이도
`[[likely]]/[[unlikely]]`	1.2-1.5배	낮음
분기 제거	2배	중간
정렬	3배	낮음
PGO	1.5-2배	중간

코드 예제 치트시트

// likely/unlikely
if (rare) [[unlikely]] { /* ... */ }

// 분기 제거
result = (condition) ? a : b;

// 마스크
int mask = (x > 0) ? 1 : 0;
sum += x * mask;

// 정렬
std::sort(data.begin(), data.end());

// 룩업 테이블
result = table[index];

다음 단계

캐시 최적화: C++ Cache Optimization
성능 최적화: C++ 성능 최적화
프로파일링: C++ Profiling

참고 자료

“Computer Architecture: A Quantitative Approach” - Hennessy, Patterson
“Optimized C++” - Kurt Guntheroth
“Agner Fog’s Optimization Manuals”

한 줄 정리: 분기 예측은 예측 가능한 패턴에서 빠르며, [[likely]]/[[unlikely]], 분기 제거, 정렬, PGO로 성능을 2-3배 개선할 수 있다.