이 내용을 실무에서 언제 쓰나요?

대량 데이터 정렬·변환·집계, 이미지 처리, 수치 시뮬레이션, ETL 파이프라인 등에서 프로파일러로 루프 병목이 확인되면 std::execution::par/par_unseq를 적용하면 됩니다.

선행으로 읽으면 좋은 글은?

SIMD·std::execution #39-3, 스레드 풀 #51-3, STL 알고리즘 선택 #54-1을 먼저 읽으면 이해가 빠릅니다.

더 깊이 공부하려면?

cppreference 병렬 알고리즘, Intel TBB, Microsoft PPL 문서를 참고하세요.

C++ 병렬 알고리즘 완벽 가이드 | std::execution::par·par_unseq

2026년 4월 5일 · 45분 읽기 · 수정 2026년 4월 5일 고급 실습

이 글의 핵심

C++17 병렬 알고리즘: std::execution::par, par_unseq, std::sort·transform·reduce 병렬화. 문제 시나리오, 완전한 예제, 흔한 에러, 베스트 프랙티스, 프로덕션 패턴까지 실전 코드로 다룹니다.

들어가며: 멀티코어가 있는데 한 코어만 쓰고 있어요

”100만 개 정렬하는데 8코어 중 1개만 100% 사용해요”

#39-3 SIMD와 std::execution에서 std::execution::par 기초를 다뤘다면, 이 글은 C++17 병렬 알고리즘을 집중적으로 다룹니다. std::sort, std::transform, std::reduce 등 기존 STL 알고리즘에 실행 정책만 추가하면 멀티코어를 활용할 수 있습니다. 스레드 풀을 직접 만들 필요 없이, 표준 라이브러리가 알아서 병렬화합니다.

비유: 병렬 알고리즘은 “일을 여러 사람에게 나눠 주는 것”입니다. 8명이 있는데 1명만 일하면 7명은 놀고 있습니다. std::execution::par를 쓰면 8명이 동시에 일합니다.

이 글을 읽으면:

std::execution::par와 par_unseq의 차이를 이해할 수 있습니다.
병렬 sort, transform, reduce를 완전한 예제로 구현할 수 있습니다.
자주 발생하는 에러(데이터 레이스, par_unseq 제약)를 피할 수 있습니다.
프로덕션에서 검증된 패턴을 적용할 수 있습니다.

요구 환경: C++17 이상, <execution> 헤더 (MSVC/GCC/Clang 지원)

1. 문제 시나리오

시나리오 1: 대용량 배열 정렬이 병목일 때

"1000만 개 int를 정렬하는데 3초가 걸려요."
"프로파일러에서 std::sort가 80%를 차지해요."
"8코어인데 한 코어만 100% 사용해요."

상황: std::sort(v.begin(), v.end())는 기본적으로 순차 실행입니다. 데이터가 크면 한 스레드만 풀가동하고 나머지 코어는 유휴 상태입니다.

해결 포인트: std::sort(std::execution::par, v.begin(), v.end())로 바꾸면 내부적으로 여러 스레드가 구간을 나눠 정렬합니다. 1000만 개 이상에서는 4~8배 가속이 흔합니다.

시나리오 2: 이미지 픽셀 변환이 느릴 때

"1920×1080 이미지에 필터를 적용하는데 50ms가 걸려요."
"60fps 목표인데 이 루프 하나 때문에 30fps밖에 안 나와요."

상황: 픽셀마다 out[i] = gamma_correct(in[i]) 같은 변환을 적용합니다. 200만 픽셀 × 순차 루프 = 한 코어만 사용합니다.

해결 포인트: std::transform(std::execution::par, in.begin(), in.end(), out.begin(), gamma_correct)로 병렬화하면 코어 수만큼 구간을 나눠 처리합니다.

시나리오 3: 대량 데이터 집계가 병목일 때

"1억 개 double의 합을 구하는데 500ms가 걸려요."
"std::accumulate는 순차라서 한 코어만 쓰고 있어요."

상황: std::accumulate는 순서가 보장되지만 병렬화할 수 없습니다. 합계·곱·최대값처럼 결합 법칙이 성립하면 순서를 바꿔도 결과가 같습니다.

해결 포인트: std::reduce(std::execution::par, v.begin(), v.end(), 0.0)로 바꾸면 부분 합을 병렬로 구한 뒤 합칩니다. 부동소수점은 accumulate와 미세하게 다른 결과가 나올 수 있으나, 대량 데이터에서는 허용되는 경우가 많습니다.

시나리오 4: ETL 파이프라인에서 변환 단계가 느릴 때

"DB에서 100만 건 읽어서 변환·필터링하는데 10초 걸려요."
"변환 로직 자체는 단순한데, 순차 처리라서 코어를 못 쓰고 있어요."

상황: std::transform + std::copy_if 조합을 순차로 실행하면 I/O 대기 후 CPU도 한 코어만 사용합니다.

해결 포인트: std::transform(std::execution::par, ...)로 변환 단계를 병렬화하고, 가능하면 std::transform_reduce로 변환·집계를 한 번에 처리합니다.

시나리오 5: 스레드 풀 없이 간단히 병렬화하고 싶을 때

"std::async를 루프마다 쓰면 future가 너무 많이 생겨요."
"스레드 풀 구현은 복잡한데, 간단한 병렬화만 필요해요."

상황: std::async를 루프 안에서 호출하면 작업 수만큼 future와 스레드가 생성됩니다. 스레드 풀을 직접 구현하려면 작업 큐·워커·종료 처리 등이 필요합니다.

해결 포인트: std::for_each(std::execution::par, ...) 또는 std::transform(std::execution::par, ...)를 쓰면 라이브러리가 내부적으로 스레드 풀을 관리합니다. 코드 한 줄 추가로 병렬화됩니다.

시나리오 6: par_unseq로 SIMD까지 활용하고 싶을 때

"par로 4배 빨라졌는데, SIMD까지 쓰면 더 빨라질까요?"
"람다가 순수 함수라서 벡터화해도 될 것 같아요."

상황: std::execution::par는 멀티스레드만 적용합니다. par_unseq는 병렬 + 벡터화(SIMD)를 허용해, 한 스레드 내에서도 4~8개 원소를 한 번에 처리할 수 있습니다.

해결 포인트: 람다가 동기화 프리(락, atomic, 공유 변수 수정 없음)일 때만 par_unseq를 사용합니다. 위반 시 정의되지 않은 동작입니다.

2. std::execution 정책 완전 가이드

정책 비교

flowchart TB
    subgraph seq["seq (순차)"]
        S1[원소 1] --> S2[원소 2] --> S3[원소 3] --> S4[...]
    end
    subgraph par["par (병렬)"]
        P1[스레드 1: 구간 A]
        P2[스레드 2: 구간 B]
        P3[스레드 3: 구간 C]
        P4[스레드 4: 구간 D]
    end
    subgraph par_unseq["par_unseq (병렬+SIMD)"]
        U1["스레드 1: SIMD로 8개씩"]
        U2["스레드 2: SIMD로 8개씩"]
    end

정책	설명	멀티스레드	SIMD	요구사항
seq	순차 실행 (기본)	❌	❌	없음
par	멀티스레드 병렬	✅	❌	반복자·함수 스레드 안전
par_unseq	병렬 + 벡터화	✅	✅	동기화 프리 (락·atomic 금지)
unseq (C++20)	단일 스레드 벡터화만	❌	✅	동기화 프리

seq vs par

// seq: 순차 실행 (기본값과 동일)
#include <algorithm>
#include <execution>
#include <vector>

void sort_sequential(std::vector<int>& v) {
    std::sort(std::execution::seq, v.begin(), v.end());
}

// par: 멀티스레드 병렬
void sort_parallel(std::vector<int>& v) {
    std::sort(std::execution::par, v.begin(), v.end());
}

par vs par_unseq

// par: 락 사용 가능 (스레드 안전만 지키면 됨)
std::mutex mtx;
std::for_each(std::execution::par, v.begin(), v.end(), [&mtx](int x) {
    std::lock_guard<std::mutex> lock(mtx);
    shared_result += process(x);
});

// par_unseq: 락·atomic·공유 변수 수정 금지 — 원소별 완전 독립만
std::transform(std::execution::par_unseq, a.begin(), a.end(), b.begin(),
               c.begin(),  { return x * y + 1.0; });

par_unseq 위반 예:

// ❌ UB: par_unseq에서 락 사용
std::mutex m;
std::for_each(std::execution::par_unseq, v.begin(), v.end(), [&m](int x) {
    std::lock_guard<std::mutex> lock(m);  // 정의되지 않은 동작!
    counter++;
});

3. 병렬 sort·transform·reduce 완전 예제

예제 1: std::execution::par — 병렬 정렬

#include <algorithm>
#include <execution>
#include <vector>
#include <random>
#include <chrono>
#include <iostream>

int main() {
    std::vector<int> v(10'000'000);
    std::mt19937 gen(42);
    std::uniform_int_distribution<> dis(0, 1'000'000);
    for (auto& x : v) x = dis(gen);

    auto start = std::chrono::high_resolution_clock::now();
    std::sort(std::execution::par, v.begin(), v.end());
    auto end = std::chrono::high_resolution_clock::now();
    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    std::cout << "Parallel sort: " << ms << " ms\n";

    return 0;
}

컴파일 (Windows MSVC):

cl /EHsc /std:c++17 /O2 /MD parallel_sort.cpp

컴파일 (Linux GCC):

g++ -std=c++17 -O3 -pthread -o parallel_sort parallel_sort.cpp

예제 2: std::execution::par_unseq — 병렬 + SIMD 변환

#include <algorithm>
#include <execution>
#include <vector>
#include <cmath>

// 감마 보정: out = pow(in, gamma)
void gamma_correct_parallel(const std::vector<float>& in,
                            std::vector<float>& out,
                            float gamma) {
    out.resize(in.size());
    std::transform(std::execution::par_unseq,
                   in.begin(), in.end(),
                   out.begin(),
                   [gamma](float x) { return std::pow(x, gamma); });
}

// 벡터 덧셈 (완전 독립 → par_unseq 적합)
void add_vectors_par_unseq(const std::vector<double>& a,
                            const std::vector<double>& b,
                            std::vector<double>& c) {
    c.resize(a.size());
    std::transform(std::execution::par_unseq,
                   a.begin(), a.end(), b.begin(), c.begin(),
                    { return x + y; });
}

예제 3: 병렬 reduce — 합계·내적·최대값

#include <numeric>
#include <execution>
#include <vector>
#include <algorithm>

// 병렬 합계
double sum_parallel(const std::vector<double>& v) {
    return std::reduce(std::execution::par, v.begin(), v.end(), 0.0);
}

// 병렬 내적 (dot product)
double dot_product_parallel(const std::vector<double>& a,
                            const std::vector<double>& b) {
    return std::transform_reduce(
        std::execution::par,
        a.begin(), a.end(), b.begin(), 0.0,
        std::plus<>(), std::multiplies<>());
}

// 병렬 최대값
int max_parallel(const std::vector<int>& v) {
    return std::reduce(std::execution::par, v.begin(), v.end(),
                       std::numeric_limits<int>::min(),
                        { return std::max(a, b); });
}

예제 4: 병렬 for_each — 인덱스 기반 처리

#include <algorithm>
#include <execution>
#include <vector>
#include <numeric>

void process_chunks_parallel(std::vector<int>& data) {
    std::vector<size_t> indices(data.size());
    std::iota(indices.begin(), indices.end(), 0);

    std::for_each(std::execution::par, indices.begin(), indices.end(),
                  [&data](size_t i) {
                      data[i] = data[i] * 2 + 1;  // 독립 연산
                  });
}

예제 5: 병렬 count_if — 조건 만족 개수

#include <algorithm>
#include <execution>
#include <vector>

size_t count_positive_parallel(const std::vector<double>& v) {
    return std::count_if(std::execution::par,
                         v.begin(), v.end(),
                          { return x > 0; });
}

예제 6: 병렬 파이프라인 — transform + reduce

#include <numeric>
#include <execution>
#include <vector>
#include <cmath>

// 각 원소 제곱 후 합계 (병렬)
double sum_of_squares_parallel(const std::vector<double>& v) {
    return std::transform_reduce(
        std::execution::par,
        v.begin(), v.end(), 0.0,
        std::plus<>(),
         { return x * x; });
}

// L2 노름: sqrt(sum(x^2))
double l2_norm_parallel(const std::vector<double>& v) {
    double sum_sq = sum_of_squares_parallel(v);
    return std::sqrt(sum_sq);
}

지원되는 병렬 알고리즘 목록

알고리즘	par 지원	par_unseq 지원
std::sort	✅	❌
std::transform	✅	✅
std::reduce	✅	✅
std::transform_reduce	✅	✅
std::for_each	✅	✅
std::count_if	✅	✅
std::find_if	✅	✅
std::copy_if	✅	✅
std::fill	✅	✅
std::generate	✅	✅

예제 7: 병렬 partial_sort — 상위 K개만 정렬

#include <algorithm>
#include <execution>
#include <vector>

// 상위 100개만 정렬 (전체 정렬보다 빠름)
void top_k_parallel(std::vector<int>& v, size_t k) {
    std::partial_sort(std::execution::par,
                      v.begin(), v.begin() + k, v.end());
}

예제 8: 병렬 inclusive_scan — 누적 합 (C++17)

#include <numeric>
#include <execution>
#include <vector>

void prefix_sum_parallel(std::vector<int>& v) {
    std::inclusive_scan(std::execution::par, v.begin(), v.end(), v.begin());
}

예제 9: 병렬 all_of / any_of / none_of

#include <algorithm>
#include <execution>
#include <vector>

bool all_positive_parallel(const std::vector<double>& v) {
    return std::all_of(std::execution::par,
                       v.begin(), v.end(),
                        { return x > 0; });
}

bool any_negative_parallel(const std::vector<double>& v) {
    return std::any_of(std::execution::par,
                       v.begin(), v.end(),
                        { return x < 0; });
}

예제 10: 병렬 fill_n — 대량 초기화

#include <algorithm>
#include <execution>
#include <vector>

void init_parallel(std::vector<int>& v, int value) {
    std::fill(std::execution::par, v.begin(), v.end(), value);
}

4. 자주 발생하는 에러와 해결법

에러 진단 플로우

flowchart TD
    A[에러 발생] --> B{데이터 레이스?}
    B -->|Yes| C[reduce/transform_reduce 사용]
    B -->|No| D{par_unseq 사용?}
    D -->|Yes| E{락/atomic/공유변수?}
    E -->|Yes| F[par로 변경]
    E -->|No| G[유지]
    D -->|No| H{반복자 무효화?}
    H -->|Yes| I[별도 출력 버퍼 사용]
    H -->|No| J[캡처·예외 검토]

에러 1: 데이터 레이스 — 공유 변수 수정

증상: 간헐적 크래시, 잘못된 결과, ThreadSanitizer 경고.

// ❌ 잘못된 예: 공유 변수에 병렬로 쓰기
int sum = 0;
std::for_each(std::execution::par, v.begin(), v.end(), [&sum](int x) {
    sum += x;  // 데이터 레이스!
});

해결법:

// ✅ reduce 사용 (원소별 독립, 부분 합 후 병합)
int sum = std::reduce(std::execution::par, v.begin(), v.end(), 0);

// 또는 atomic (성능 저하, 꼭 필요할 때만)
std::atomic<int> sum{0};
std::for_each(std::execution::par, v.begin(), v.end(), [&sum](int x) {
    sum.fetch_add(x);  // 동기화 오버헤드 큼
});

에러 2: par_unseq에서 락 사용

증상: 데드락, 정의되지 않은 동작, 간헐적 크래시.

// ❌ par_unseq에서 mutex 사용 — UB
std::mutex mtx;
std::for_each(std::execution::par_unseq, v.begin(), v.end(), [&mtx](int x) {
    std::lock_guard<std::mutex> lock(mtx);
    process(x);
});

해결법:

// ✅ par만 사용 (락 허용)
std::for_each(std::execution::par, v.begin(), v.end(), [&mtx](int x) {
    std::lock_guard<std::mutex> lock(mtx);
    process(x);
});

// 또는 락 없이 각 스레드별 로컬 결과 수집 후 병합
std::vector<int> results(v.size());
std::transform(std::execution::par_unseq, v.begin(), v.end(), results.begin(),
                { return process(x); });

에러 3: 반복자 무효화

증상: 크래시, 잘못된 결과.

// ❌ 병렬 처리 중 컨테이너 수정
std::for_each(std::execution::par, v.begin(), v.end(), [&v](int x) {
    if (x > 0) v.push_back(x);  // 반복자 무효화!
});

해결법:

// ✅ 출력을 별도 컨테이너에
std::vector<int> result;
result.reserve(v.size());
std::mutex mtx;
std::for_each(std::execution::par, v.begin(), v.end(), [&](int x) {
    if (x > 0) {
        std::lock_guard<std::mutex> lock(mtx);
        result.push_back(x);
    }
});

// 또는 copy_if + par
std::vector<int> result(v.size());
auto end = std::copy_if(std::execution::par, v.begin(), v.end(), result.begin(),
                         { return x > 0; });
result.erase(end, result.end());

에러 4: 람다 캡처로 use-after-free

증상: 크래시, 쓰레기 값.

// ❌ 참조 캡처 — 스코프 벗어나면 무효
void process_async(const std::vector<int>& data) {
    std::for_each(std::execution::par, data.begin(), data.end(),
                  [&data](int x) {
                      use(data, x);  // data는 process_async 반환 후 무효 가능
                  });
}

해결법:

// ✅ 값 캡처 또는 반복자 범위만 캡처
void process_async(std::vector<int> data) {  // 복사 또는 move
    std::for_each(std::execution::par, data.begin(), data.end(),
                  [&data](int x) { use(data, x); });
}

에러 5: 부동소수점 reduce vs accumulate 차이

증상: std::reduce 결과가 std::accumulate와 미세하게 다름.

// accumulate: 순서 보장 (a + b + c + d)
// reduce: 부분 합 병합 ((a+b) + (c+d)) — 결합 순서 비결정
std::vector<float> v(1000000, 0.1f);
float acc = std::accumulate(v.begin(), v.end(), 0.0f);
float red = std::reduce(std::execution::par, v.begin(), v.end(), 0.0f);
// acc != red (부동소수점 누적 오차로 인해)

해결법: 순서가 중요하면 accumulate(순차). 대량 데이터에서 성능이 중요하고 미세 오차가 허용되면 reduce(병렬). Kahan summation이 필요하면 별도 구현.

에러 6: 빈 범위 또는 단일 원소

증상: 일부 구현에서 예외 또는 비정상 동작.

// 빈 벡터
std::vector<int> empty;
std::sort(std::execution::par, empty.begin(), empty.end());  // OK (no-op)

// 단일 원소 — 병렬화 이득 없지만 안전
std::vector<int> single = {42};
std::sort(std::execution::par, single.begin(), single.end());  // OK

해결법: 표준은 빈 범위를 허용합니다. 구현체에 따라 작은 크기에서는 순차로 폴백할 수 있으므로, 임계값(예: 1000 이상) 이상에서만 par를 쓰는 선택적 적용도 가능합니다.

에러 7: MSVC에서 execution 헤더 링크 오류

증상: LNK2019: unresolved external symbol (parallel algorithms).

해결법: MSVC는 병렬 알고리즘에 Intel TBB 또는 Microsoft PPL을 사용합니다. vcpkg로 TBB 설치:

vcpkg install tbb:x64-windows

그리고 프로젝트에 링크:

# CMake
find_package(TBB REQUIRED)
target_link_libraries(myapp TBB::tbb)

에러 8: 정렬·검색 기준 불일치

증상: std::lower_bound 등으로 찾은 결과가 기대와 다름.

// ❌ 정렬은 name, 검색은 id — UB
std::sort(std::execution::par, users.begin(), users.end(),
          { return a.name < b.name; });
auto it = std::lower_bound(users.begin(), users.end(), target_id,
     { return u.id < id; });

해결법: 정렬 기준과 검색 기준이 동일해야 합니다.

// ✅ id로 정렬 후 id로 검색
std::sort(std::execution::par, users.begin(), users.end(),
          { return a.id < b.id; });
auto it = std::lower_bound(users.begin(), users.end(), target_id,
     { return u.id < id; });

에러 9: 커스텀 비교자에서 스레드 안전 위반

증상: 병렬 정렬 시 간헐적 크래시, 잘못된 결과.

// ❌ 비교자가 캡처한 상태를 수정함
int counter = 0;
std::sort(std::execution::par, v.begin(), v.end(),
    [&counter](int a, int b) {
        counter++;  // 데이터 레이스!
        return a < b;
    });

해결법: 비교자·함수 객체는 상태 불변이어야 합니다. 읽기 전용 캡처만 사용합니다.

// ✅ 순수 함수
std::sort(std::execution::par, v.begin(), v.end(),
     { return a < b; });

에러 10: 반복자 종류 제한

증상: std::sort(std::execution::par, list.begin(), list.end()) — 컴파일 에러.

원인: std::sort는 RandomAccessIterator가 필요합니다. std::list는 BidirectionalIterator만 제공합니다.

해결법: std::list는 list.sort() 멤버 함수를 사용합니다. 병렬 정렬이 필요하면 std::vector로 복사 후 정렬 후 다시 복사하거나, 다른 자료구조를 사용합니다.

// std::list → vector로 복사 후 병렬 정렬
std::list<int> lst = {3, 1, 4, 1, 5};
std::vector<int> vec(lst.begin(), lst.end());
std::sort(std::execution::par, vec.begin(), vec.end());
lst.assign(vec.begin(), vec.end());

5. 베스트 프랙티스

1. 작은 데이터는 seq 유지

// 100개 미만: 병렬 오버헤드가 이득보다 클 수 있음
if (v.size() < 1000) {
    std::sort(std::execution::seq, v.begin(), v.end());
} else {
    std::sort(std::execution::par, v.begin(), v.end());
}

2. 독립성 확인 후 par_unseq

// 원소 간 독립 + 동기화 없음 → par_unseq
std::transform(std::execution::par_unseq, a.begin(), a.end(), b.begin(),
                { return std::sqrt(x); });

// 공유 상태 접근 → par만
std::for_each(std::execution::par, v.begin(), v.end(), [&](int x) {
    std::lock_guard<std::mutex> lock(mtx);
    log(x);
});

3. reserve로 재할당 방지

std::vector<double> result;
result.resize(input.size());  // 또는 reserve + back_inserter
std::transform(std::execution::par, input.begin(), input.end(),
               result.begin(), transform_fn);

4. 예외 안전성

// 병렬 알고리즘에서 예외 발생 시 std::terminate 호출 가능
// 람다 내부에서 예외를 잡아 처리하거나, noexcept 보장
std::transform(std::execution::par, a.begin(), a.end(), b.begin(),
                noexcept { return x * 2; });

5. 프로파일링 후 적용

// 추측이 아닌 측정
auto start = std::chrono::high_resolution_clock::now();
std::sort(std::execution::par, v.begin(), v.end());
auto end = std::chrono::high_resolution_clock::now();
// 병목이 확인된 부분만 병렬화

6. 연속 메모리 활용

// ✅ std::vector — 연속 메모리, 캐시 친화적
std::vector<double> data(1'000'000);
std::transform(std::execution::par, data.begin(), data.end(), data.begin(),
               { return std::sqrt(x); });

// ⚠️ std::deque — 연속이 아님, 병렬 알고리즘은 동작하지만 캐시 효율 낮음

7. 이동 시맨틱 활용

// ✅ 이동으로 불필요한 복사 제거
std::vector<BigObject> process_parallel(std::vector<BigObject> input) {
    std::vector<BigObject> result(input.size());
    std::transform(std::execution::par,
                  std::make_move_iterator(input.begin()),
                  std::make_move_iterator(input.end()),
                  result.begin(),
                   { return process(std::move(obj)); });
    return result;
}

8. 조건부 병렬화

#include <algorithm>
#include <execution>
#include <vector>

template <typename It>
void sort_adaptive(It first, It last) {
    constexpr size_t PARALLEL_THRESHOLD = 10'000;
    if (std::distance(first, last) < PARALLEL_THRESHOLD) {
        std::sort(std::execution::seq, first, last);
    } else {
        std::sort(std::execution::par, first, last);
    }
}

6. 성능 벤치마크

벤치마크 예제 (sort)

#include <algorithm>
#include <execution>
#include <vector>
#include <random>
#include <chrono>
#include <iostream>

int main() {
    const size_t N = 10'000'000;
    std::vector<int> v(N);
    std::mt19937 gen(42);
    std::uniform_int_distribution<> dis(0, 1'000'000);
    for (auto& x : v) x = dis(gen);

    auto seq = v;
    auto par = v;

    auto t1 = std::chrono::high_resolution_clock::now();
    std::sort(std::execution::seq, seq.begin(), seq.end());
    auto t2 = std::chrono::high_resolution_clock::now();
    std::sort(std::execution::par, par.begin(), par.end());
    auto t3 = std::chrono::high_resolution_clock::now();

    auto ms_seq = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count();
    auto ms_par = std::chrono::duration_cast<std::chrono::milliseconds>(t3 - t2).count();
    std::cout << "seq: " << ms_seq << " ms, par: " << ms_par << " ms, speedup: "
              << (double)ms_seq / ms_par << "x\n";
    return 0;
}

예상 결과 (8코어 CPU, 1000만 int)

알고리즘	seq	par	가속비
sort	~800ms	~150ms	~5x
transform (1M float)	~4ms	~1ms	~4x
reduce (10M double)	~25ms	~6ms	~4x

7. 프로덕션 패턴

패턴 1: 크기 기반 정책 선택

template <typename It>
void sort_adaptive(It first, It last) {
    auto n = std::distance(first, last);
    if (n < 10'000) {
        std::sort(std::execution::seq, first, last);
    } else {
        std::sort(std::execution::par, first, last);
    }
}

패턴 2: 파이프라인 — transform → filter → reduce

// 1. 변환 (병렬)
std::vector<double> transformed(data.size());
std::transform(std::execution::par, data.begin(), data.end(),
               transformed.begin(),  { return std::sqrt(x); });

// 2. 필터 (병렬)
std::vector<double> filtered;
std::copy_if(std::execution::par, transformed.begin(), transformed.end(),
             std::back_inserter(filtered),  { return x > 0; });

// 3. 집계 (병렬)
double sum = std::reduce(std::execution::par, filtered.begin(), filtered.end(), 0.0);

패턴 3: 배치 처리 — 청크 단위 병렬

void process_batch_parallel(const std::vector<Item>& items,
                            std::vector<Result>& results) {
    results.resize(items.size());
    std::transform(std::execution::par,
                   items.begin(), items.end(),
                   results.begin(),
                    { return process(item); });
}

패턴 4: std::execution과 스레드 풀 병행

// 병렬 알고리즘: 데이터 중심, 일괄 처리
std::sort(std::execution::par, v.begin(), v.end());

// 스레드 풀: 작업 중심, 비동기 이벤트
pool.submit([&]() { handle_request(req); });

패턴 5: 예외 처리 래퍼

template <typename Policy, typename It, typename... Args>
auto safe_parallel_sort(Policy policy, It first, It last, Args&&... args) {
    try {
        std::sort(policy, first, last, std::forward<Args>(args)...);
    } catch (const std::exception& e) {
        // 로깅, 폴백 처리
        std::sort(std::execution::seq, first, last, std::forward<Args>(args)...);
    }
}

패턴 6: Map-Reduce 스타일 집계

// 1단계: 각 청크별 부분 결과 (map)
// 2단계: 부분 결과 병합 (reduce)
struct Stats {
    double sum = 0;
    size_t count = 0;
};
Stats aggregate_parallel(const std::vector<double>& v) {
    return std::transform_reduce(
        std::execution::par,
        v.begin(), v.end(),
        Stats{},
         {
            return Stats{a.sum + b.sum, a.count + b.count};
        },
         { return Stats{x, 1}; });
}

패턴 7: 병렬 정렬 + 이진 검색 파이프라인

// 대량 데이터 정렬 후 반복 검색
void build_lookup_table(std::vector<std::pair<int, Data>>& table) {
    std::sort(std::execution::par, table.begin(), table.end(),
               { return a.first < b.first; });
}

Data lookup(const std::vector<std::pair<int, Data>>& table, int key) {
    auto it = std::lower_bound(table.begin(), table.end(), key,
         { return p.first < k; });
    return (it != table.end() && it->first == key) ? it->second : Data{};
}

패턴 8: 스레드 로컬 버퍼 + 병렬 병합

// 각 스레드가 로컬 버퍼에 쓰고, 마지막에 병합
std::vector<int> process_with_local_buffers(const std::vector<int>& input) {
    thread_local std::vector<int> local;
    local.clear();
    std::mutex mtx;
    std::vector<int> global;

    std::for_each(std::execution::par, input.begin(), input.end(),
        [&](int x) {
            int result = process(x);
            {
                std::lock_guard<std::mutex> lock(mtx);
                global.push_back(result);
            }
        });
    return global;
}

패턴 9: CPU 코어 수 기반 청크 분할

#include <thread>
#include <algorithm>
#include <execution>

// 라이브러리가 자동으로 처리하지만, 수동 제어가 필요할 때
size_t optimal_chunk_size(size_t total) {
    size_t cores = std::thread::hardware_concurrency();
    return std::max(size_t(1), total / (cores * 4));  // 코어당 4청크
}

패턴 10: 병렬 알고리즘 + 스레드 풀 혼합

// 배치 데이터: std::execution::par (데이터 병렬)
void process_batch(std::vector<Item>& items) {
    std::transform(std::execution::par, items.begin(), items.end(),
                  items.begin(), process_item);
}

// 이벤트 기반 작업: 스레드 풀 (작업 병렬)
void handle_requests(ThreadPool& pool, const std::vector<Request>& reqs) {
    for (const auto& req : reqs) {
        pool.submit([req]() { handle_request(req); });
    }
}

8. 정리 및 체크리스트

핵심 요약

항목	내용
std::execution::par	멀티스레드 병렬, 락·공유 변수 수정 시 주의
std::execution::par_unseq	병렬 + SIMD, 동기화 프리 필수
sort	병렬 정렬, 대용량에서 4~8배 가속
transform	병렬 변환, 픽셀·배열 처리 등
reduce	병렬 집계, 부동소수점 순서 비결정 주의

구현 체크리스트

프로파일러로 병목 확인
데이터 레이스 없음 (공유 변수 수정 금지)
par_unseq 사용 시 락·atomic·공유 상태 접근 없음
작은 데이터(<1000)는 seq 유지 검토
MSVC 사용 시 TBB 링크 확인
예외 처리 또는 noexcept 보장
reserve/resize로 출력 버퍼 사전 할당

참고 자료

실무 팁

개발 시 주의사항

[팁 1]: [설명]
```
// 예시 코드
```
[팁 2]: [설명]
```
// 예시 코드
```
[팁 3]: [설명]

디버깅 방법

[방법 1]: [설명]
[방법 2]: [설명]
[방법 3]: [설명]

FAQ

Q. 병렬 알고리즘을 언제 적용해야 하나요?

A. 프로파일에서 std::sort, std::transform, std::reduce 등이 병목일 때, 대량 데이터(수만~수백만 이상)에서 std::execution::par를 적용합니다. 1000개 미만에서는 오버헤드가 이득보다 클 수 있어 seq를 유지하는 것이 좋습니다.

Q. par와 par_unseq 중 뭘 써야 하나요?

A. 먼저 par로 병렬화해 효과를 확인합니다. 람다가 완전히 독립적이고(락·atomic·공유 변수 없음) 추가 가속이 필요하면 par_unseq를 시도합니다. par_unseq 위반 시 UB이므로 주의가 필요합니다.

Q. std::accumulate를 std::reduce로 바꿔도 되나요?

A. 합계·곱·최대값처럼 결합 법칙이 성립하면 reduce로 바꿔도 됩니다. 부동소수점은 accumulate와 미세하게 다른 결과가 나올 수 있으므로, 수치 정확도가 중요한 경우에는 검토가 필요합니다.

Q. Windows에서 링크 에러가 나요.

A. MSVC 병렬 알고리즘은 Intel TBB 또는 Windows SDK의 병렬 런타임을 사용합니다. vcpkg install tbb 후 target_link_libraries(myapp TBB::tbb)를 추가하세요.

한 줄 요약: std::execution::par로 sort·transform·reduce를 한 줄에 병렬화할 수 있습니다. 데이터 레이스와 par_unseq 제약을 지키면 멀티코어를 안전하게 활용할 수 있습니다. 다음으로 데이터베이스 쿼리 최적화 #51-8를 읽어보면 좋습니다.

C++ 스레드 풀 완벽 가이드 | 작업 큐·병렬 처리·성능 벤치마크 [#51-3]
C++ 고급 프로파일링 완벽 가이드 | perf·gprof
C++ Execution Policy |
C++ 알고리즘 |

이 글의 핵심

들어가며: 멀티코어가 있는데 한 코어만 쓰고 있어요

”100만 개 정렬하는데 8코어 중 1개만 100% 사용해요”

목차

1. 문제 시나리오

시나리오 1: 대용량 배열 정렬이 병목일 때

시나리오 2: 이미지 픽셀 변환이 느릴 때

시나리오 3: 대량 데이터 집계가 병목일 때

시나리오 4: ETL 파이프라인에서 변환 단계가 느릴 때

시나리오 5: 스레드 풀 없이 간단히 병렬화하고 싶을 때

시나리오 6: par_unseq로 SIMD까지 활용하고 싶을 때

2. std::execution 정책 완전 가이드

정책 비교

seq vs par

par vs par_unseq

3. 병렬 sort·transform·reduce 완전 예제

예제 1: std::execution::par — 병렬 정렬

예제 2: std::execution::par_unseq — 병렬 + SIMD 변환

예제 3: 병렬 reduce — 합계·내적·최대값

예제 4: 병렬 for_each — 인덱스 기반 처리

예제 5: 병렬 count_if — 조건 만족 개수

예제 6: 병렬 파이프라인 — transform + reduce

지원되는 병렬 알고리즘 목록

예제 7: 병렬 partial_sort — 상위 K개만 정렬

예제 8: 병렬 inclusive_scan — 누적 합 (C++17)

예제 9: 병렬 all_of / any_of / none_of

예제 10: 병렬 fill_n — 대량 초기화

4. 자주 발생하는 에러와 해결법

에러 진단 플로우

에러 1: 데이터 레이스 — 공유 변수 수정

에러 2: par_unseq에서 락 사용

에러 3: 반복자 무효화

에러 4: 람다 캡처로 use-after-free

에러 5: 부동소수점 reduce vs accumulate 차이

에러 6: 빈 범위 또는 단일 원소

에러 7: MSVC에서 execution 헤더 링크 오류

에러 8: 정렬·검색 기준 불일치

에러 9: 커스텀 비교자에서 스레드 안전 위반

에러 10: 반복자 종류 제한

5. 베스트 프랙티스

1. 작은 데이터는 seq 유지

2. 독립성 확인 후 par_unseq

3. reserve로 재할당 방지

4. 예외 안전성

5. 프로파일링 후 적용

6. 연속 메모리 활용

7. 이동 시맨틱 활용

8. 조건부 병렬화

6. 성능 벤치마크

벤치마크 예제 (sort)

예상 결과 (8코어 CPU, 1000만 int)

7. 프로덕션 패턴

패턴 1: 크기 기반 정책 선택

패턴 2: 파이프라인 — transform → filter → reduce

패턴 3: 배치 처리 — 청크 단위 병렬

패턴 4: std::execution과 스레드 풀 병행

패턴 5: 예외 처리 래퍼

패턴 6: Map-Reduce 스타일 집계

패턴 7: 병렬 정렬 + 이진 검색 파이프라인

패턴 8: 스레드 로컬 버퍼 + 병렬 병합

패턴 9: CPU 코어 수 기반 청크 분할

패턴 10: 병렬 알고리즘 + 스레드 풀 혼합

8. 정리 및 체크리스트

핵심 요약

구현 체크리스트

참고 자료

실무 팁

개발 시 주의사항

디버깅 방법

FAQ

관련 글