이 내용을 실무에서 언제 쓰나요?

대량 데이터 정렬·변환·집계, 이미지 처리, 수치 시뮬레이션, ETL 파이프라인 등에서 프로파일러로 루프 병목이 확인되면 std::execution::par/par_unseq를 적용하면 됩니다.

선행으로 읽으면 좋은 글은?

SIMD·std::execution #39-3, 스레드 풀 #51-3, STL 알고리즘 선택 #54-1을 먼저 읽으면 이해가 빠릅니다.

더 깊이 공부하려면?

cppreference 병렬 알고리즘, Intel TBB, Microsoft PPL 문서를 참고하세요.

C++ 병렬 알고리즘 완벽 가이드 | std::execution::par·par_unseq

2026년 4월 5일 · 45분 읽기 · 수정 2026년 4월 7일 고급 실습

이 글의 핵심

C++17 병렬 알고리즘: std::execution::par, par_unseq, std::sort·transform·reduce 병렬화. 문제 시나리오, 완전한 예제, 흔한 에러, 베스트 프랙티스, 프로덕션 패턴까지 실전 코드로 다룹니다.

들어가며: 멀티코어가 있는데 한 코어만 쓰고 있어요

”100만 개 정렬하는데 8코어 중 1개만 100% 사용해요”

#39-3 SIMD와 std::execution에서 std::execution::par 기초를 다뤘다면, 이 글은 C++17 병렬 알고리즘을 집중적으로 다룹니다. std::sort, std::transform, std::reduce 등 기존 STL 알고리즘에 실행 정책만 추가하면 멀티코어를 활용할 수 있습니다. 스레드 풀을 직접 만들 필요 없이, 표준 라이브러리가 알아서 병렬화합니다. 비유: 병렬 알고리즘은 “일을 여러 사람에게 나눠 주는 것”입니다. 8명이 있는데 1명만 일하면 7명은 놀고 있습니다. std::execution::par를 쓰면 8명이 동시에 일합니다. 이 글을 읽으면:

std::execution::par와 par_unseq의 차이를 이해할 수 있습니다.
병렬 sort, transform, reduce를 완전한 예제로 구현할 수 있습니다.
자주 발생하는 에러(데이터 레이스, par_unseq 제약)를 피할 수 있습니다.
프로덕션에서 검증된 패턴을 적용할 수 있습니다. 요구 환경: C++17 이상, <execution> 헤더 (MSVC/GCC/Clang 지원)

실무 적용 경험: 이 글은 대규모 C++ 프로젝트에서 실제로 겪은 문제와 해결 과정을 바탕으로 작성되었습니다. 책이나 문서에서 다루지 않는 실전 함정과 디버깅 팁을 포함합니다.

1. 문제 시나리오

시나리오 1: 대용량 배열 정렬이 병목일 때

"1000만 개 int를 정렬하는데 3초가 걸려요."
"프로파일러에서 std::sort가 80%를 차지해요."
"8코어인데 한 코어만 100% 사용해요."

상황: std::sort(v.begin(), v.end())는 기본적으로 순차 실행입니다. 데이터가 크면 한 스레드만 풀가동하고 나머지 코어는 유휴 상태입니다. 해결 포인트: std::sort(std::execution::par, v.begin(), v.end())로 바꾸면 내부적으로 여러 스레드가 구간을 나눠 정렬합니다. 1000만 개 이상에서는 4~8배 가속이 흔합니다.

시나리오 2: 이미지 픽셀 변환이 느릴 때

"1920×1080 이미지에 필터를 적용하는데 50ms가 걸려요."
"60fps 목표인데 이 루프 하나 때문에 30fps밖에 안 나와요."

상황: 픽셀마다 out[i] = gamma_correct(in[i]) 같은 변환을 적용합니다. 200만 픽셀 × 순차 루프 = 한 코어만 사용합니다. 해결 포인트: std::transform(std::execution::par, in.begin(), in.end(), out.begin(), gamma_correct)로 병렬화하면 코어 수만큼 구간을 나눠 처리합니다.

시나리오 3: 대량 데이터 집계가 병목일 때

"1억 개 double의 합을 구하는데 500ms가 걸려요."
"std::accumulate는 순차라서 한 코어만 쓰고 있어요."

상황: std::accumulate는 순서가 보장되지만 병렬화할 수 없습니다. 합계·곱·최대값처럼 결합 법칙이 성립하면 순서를 바꿔도 결과가 같습니다. 해결 포인트: std::reduce(std::execution::par, v.begin(), v.end(), 0.0)로 바꾸면 부분 합을 병렬로 구한 뒤 합칩니다. 부동소수점은 accumulate와 미세하게 다른 결과가 나올 수 있으나, 대량 데이터에서는 허용되는 경우가 많습니다.

시나리오 4: ETL 파이프라인에서 변환 단계가 느릴 때

"DB에서 100만 건 읽어서 변환·필터링하는데 10초 걸려요."
"변환 로직 자체는 단순한데, 순차 처리라서 코어를 못 쓰고 있어요."

상황: std::transform + std::copy_if 조합을 순차로 실행하면 I/O 대기 후 CPU도 한 코어만 사용합니다. 해결 포인트: std::transform(std::execution::par, ...)로 변환 단계를 병렬화하고, 가능하면 std::transform_reduce로 변환·집계를 한 번에 처리합니다.

시나리오 5: 스레드 풀 없이 간단히 병렬화하고 싶을 때

"std::async를 루프마다 쓰면 future가 너무 많이 생겨요."
"스레드 풀 구현은 복잡한데, 간단한 병렬화만 필요해요."

상황: std::async를 루프 안에서 호출하면 작업 수만큼 future와 스레드가 생성됩니다. 스레드 풀을 직접 구현하려면 작업 큐·워커·종료 처리 등이 필요합니다. 해결 포인트: std::for_each(std::execution::par, ...) 또는 std::transform(std::execution::par, ...)를 쓰면 라이브러리가 내부적으로 스레드 풀을 관리합니다. 코드 한 줄 추가로 병렬화됩니다.

시나리오 6: par_unseq로 SIMD까지 활용하고 싶을 때

"par로 4배 빨라졌는데, SIMD까지 쓰면 더 빨라질까요?"
"람다가 순수 함수라서 벡터화해도 될 것 같아요."

상황: `std::execution::par`는 멀티스레드만 적용합니다. `par_unseq`는 병렬 + 벡터화(SIMD)를 허용해, 한 스레드 내에서도 4~8개 원소를 한 번에 처리할 수 있습니다. 해결 포인트: 람다가 동기화 프리(락, atomic, 공유 변수 수정 없음)일 때만 `par_unseq`를 사용합니다. 위반 시 정의되지 않은 동작입니다.

2. std::execution 정책 완전 가이드

정책 비교

flowchart TB
    subgraph seq["seq (순차)"]
        S1[원소 1] --> S2[원소 2] --> S3[원소 3] --> S4[...]
    end
    subgraph par["par (병렬)"]
        P1[스레드 1: 구간 A]
        P2[스레드 2: 구간 B]
        P3[스레드 3: 구간 C]
        P4[스레드 4: 구간 D]
    end
    subgraph par_unseq["par_unseq (병렬+SIMD)"]
        U1["스레드 1: SIMD로 8개씩"]
        U2["스레드 2: SIMD로 8개씩"]
    end

정책	설명	멀티스레드	SIMD	요구사항
seq	순차 실행 (기본)	❌	❌	없음
par	멀티스레드 병렬	✅	❌	반복자·함수 스레드 안전
par_unseq	병렬 + 벡터화	✅	✅	동기화 프리 (락·atomic 금지)
unseq (C++20)	단일 스레드 벡터화만	❌	✅	동기화 프리

seq vs par

// seq: 순차 실행 (기본값과 동일)
#include <algorithm>
#include <execution>
#include <vector>
void sort_sequential(std::vector<int>& v) {
    std::sort(std::execution::seq, v.begin(), v.end());
}
// par: 멀티스레드 병렬
void sort_parallel(std::vector<int>& v) {
    std::sort(std::execution::par, v.begin(), v.end());
}

par vs par_unseq

// par: 락 사용 가능 (스레드 안전만 지키면 됨)
std::mutex mtx;
std::for_each(std::execution::par, v.begin(), v.end(), [&mtx](int x) {
    std::lock_guard<std::mutex> lock(mtx);
    shared_result += process(x);
});
// par_unseq: 락·atomic·공유 변수 수정 금지 — 원소별 완전 독립만
std::transform(std::execution::par_unseq, a.begin(), a.end(), b.begin(),
               c.begin(),  { return x * y + 1.0; });

par_unseq 위반 예:

// ❌ UB: par_unseq에서 락 사용
std::mutex m;
std::for_each(std::execution::par_unseq, v.begin(), v.end(), [&m](int x) {
    std::lock_guard<std::mutex> lock(m);  // 정의되지 않은 동작!
    counter++;
});

3. 병렬 sort·transform·reduce 완전 예제

예제 1: std::execution::par — 병렬 정렬

#include <algorithm>
#include <execution>
#include <vector>
#include <random>
#include <chrono>
#include <iostream>
int main() {
    std::vector<int> v(10'000'000);
    std::mt19937 gen(42);
    std::uniform_int_distribution<> dis(0, 1'000'000);
    for (auto& x : v) x = dis(gen);
    auto start = std::chrono::high_resolution_clock::now();
    std::sort(std::execution::par, v.begin(), v.end());
    auto end = std::chrono::high_resolution_clock::now();
    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    std::cout << "Parallel sort: " << ms << " ms\n";
    return 0;
}

컴파일 (Windows MSVC):

cl /EHsc /std:c++17 /O2 /MD parallel_sort.cpp

컴파일 (Linux GCC):

g++ -std=c++17 -O3 -pthread -o parallel_sort parallel_sort.cpp

예제 2: std::execution::par_unseq — 병렬 + SIMD 변환

#include <algorithm>
#include <execution>
#include <vector>
#include <cmath>
// 감마 보정: out = pow(in, gamma)
void gamma_correct_parallel(const std::vector<float>& in,
                            std::vector<float>& out,
                            float gamma) {
    out.resize(in.size());
    std::transform(std::execution::par_unseq,
                   in.begin(), in.end(),
                   out.begin(),
                   [gamma](float x) { return std::pow(x, gamma); });
}
// 벡터 덧셈 (완전 독립 → par_unseq 적합)
void add_vectors_par_unseq(const std::vector<double>& a,
                            const std::vector<double>& b,
                            std::vector<double>& c) {
    c.resize(a.size());
    std::transform(std::execution::par_unseq,
                   a.begin(), a.end(), b.begin(), c.begin(),
                    { return x + y; });
}

예제 3: 병렬 reduce — 합계·내적·최대값

#include <numeric>
#include <execution>
#include <vector>
#include <algorithm>
// 병렬 합계
double sum_parallel(const std::vector<double>& v) {
    return std::reduce(std::execution::par, v.begin(), v.end(), 0.0);
}
// 병렬 내적 (dot product)
double dot_product_parallel(const std::vector<double>& a,
                            const std::vector<double>& b) {
    return std::transform_reduce(
        std::execution::par,
        a.begin(), a.end(), b.begin(), 0.0,
        std::plus<>(), std::multiplies<>());
}
// 병렬 최대값
int max_parallel(const std::vector<int>& v) {
    return std::reduce(std::execution::par, v.begin(), v.end(),
                       std::numeric_limits<int>::min(),
                        { return std::max(a, b); });
}

예제 4: 병렬 for_each — 인덱스 기반 처리

#include <algorithm>
#include <execution>
#include <vector>
#include <numeric>
void process_chunks_parallel(std::vector<int>& data) {
    std::vector<size_t> indices(data.size());
    std::iota(indices.begin(), indices.end(), 0);
    std::for_each(std::execution::par, indices.begin(), indices.end(),
                  [&data](size_t i) {
                      data[i] = data[i] * 2 + 1;  // 독립 연산
                  });
}

예제 5: 병렬 count_if — 조건 만족 개수

#include <algorithm>
#include <execution>
#include <vector>
size_t count_positive_parallel(const std::vector<double>& v) {
    return std::count_if(std::execution::par,
                         v.begin(), v.end(),
                          { return x > 0; });
}

예제 6: 병렬 파이프라인 — transform + reduce

#include <numeric>
#include <execution>
#include <vector>
#include <cmath>
// 각 원소 제곱 후 합계 (병렬)
double sum_of_squares_parallel(const std::vector<double>& v) {
    return std::transform_reduce(
        std::execution::par,
        v.begin(), v.end(), 0.0,
        std::plus<>(),
         { return x * x; });
}
// L2 노름: sqrt(sum(x^2))
double l2_norm_parallel(const std::vector<double>& v) {
    double sum_sq = sum_of_squares_parallel(v);
    return std::sqrt(sum_sq);
}

지원되는 병렬 알고리즘 목록

알고리즘	par 지원	par_unseq 지원
std::sort	✅	❌
std::transform	✅	✅
std::reduce	✅	✅
std::transform_reduce	✅	✅
std::for_each	✅	✅
std::count_if	✅	✅
std::find_if	✅	✅
std::copy_if	✅	✅
std::fill	✅	✅
std::generate	✅	✅

예제 7: 병렬 partial_sort — 상위 K개만 정렬

#include <algorithm>
#include <execution>
#include <vector>
// 상위 100개만 정렬 (전체 정렬보다 빠름)
void top_k_parallel(std::vector<int>& v, size_t k) {
    std::partial_sort(std::execution::par,
                      v.begin(), v.begin() + k, v.end());
}

예제 8: 병렬 inclusive_scan — 누적 합 (C++17)

#include <numeric>
#include <execution>
#include <vector>
void prefix_sum_parallel(std::vector<int>& v) {
    std::inclusive_scan(std::execution::par, v.begin(), v.end(), v.begin());
}

예제 9: 병렬 all_of / any_of / none_of

#include <algorithm>
#include <execution>
#include <vector>
bool all_positive_parallel(const std::vector<double>& v) {
    return std::all_of(std::execution::par,
                       v.begin(), v.end(),
                        { return x > 0; });
}
bool any_negative_parallel(const std::vector<double>& v) {
    return std::any_of(std::execution::par,
                       v.begin(), v.end(),
                        { return x < 0; });
}

예제 10: 병렬 fill_n — 대량 초기화

#include <algorithm>
#include <execution>
#include <vector>
void init_parallel(std::vector<int>& v, int value) {
    std::fill(std::execution::par, v.begin(), v.end(), value);
}

4. 자주 발생하는 에러와 해결법

에러 진단 플로우

flowchart TD
    A[에러 발생] --> B{데이터 레이스?}
    B -->|Yes| C[reduce/transform_reduce 사용]
    B -->|No| D{par_unseq 사용?}
    D -->|Yes| E{락/atomic/공유변수?}
    E -->|Yes| F[par로 변경]
    E -->|No| G[유지]
    D -->|No| H{반복자 무효화?}
    H -->|Yes| I[별도 출력 버퍼 사용]
    H -->|No| J[캡처·예외 검토]

에러 1: 데이터 레이스 — 공유 변수 수정

증상: 간헐적 크래시, 잘못된 결과, ThreadSanitizer 경고.

// ❌ 잘못된 예: 공유 변수에 병렬로 쓰기
int sum = 0;
std::for_each(std::execution::par, v.begin(), v.end(), [&sum](int x) {
    sum += x;  // 데이터 레이스!
});

해결법:

// ✅ reduce 사용 (원소별 독립, 부분 합 후 병합)
int sum = std::reduce(std::execution::par, v.begin(), v.end(), 0);
// 또는 atomic (성능 저하, 꼭 필요할 때만)
std::atomic<int> sum{0};
std::for_each(std::execution::par, v.begin(), v.end(), [&sum](int x) {
    sum.fetch_add(x);  // 동기화 오버헤드 큼
});

에러 2: par_unseq에서 락 사용

증상: 데드락, 정의되지 않은 동작, 간헐적 크래시.

// ❌ par_unseq에서 mutex 사용 — UB
std::mutex mtx;
std::for_each(std::execution::par_unseq, v.begin(), v.end(), [&mtx](int x) {
    std::lock_guard<std::mutex> lock(mtx);
    process(x);
});

해결법:

// ✅ par만 사용 (락 허용)
std::for_each(std::execution::par, v.begin(), v.end(), [&mtx](int x) {
    std::lock_guard<std::mutex> lock(mtx);
    process(x);
});
// 또는 락 없이 각 스레드별 로컬 결과 수집 후 병합
std::vector<int> results(v.size());
std::transform(std::execution::par_unseq, v.begin(), v.end(), results.begin(),
                { return process(x); });

에러 3: 반복자 무효화

증상: 크래시, 잘못된 결과.

// ❌ 병렬 처리 중 컨테이너 수정
std::for_each(std::execution::par, v.begin(), v.end(), [&v](int x) {
    if (x > 0) v.push_back(x);  // 반복자 무효화!
});

해결법:

// ✅ 출력을 별도 컨테이너에
std::vector<int> result;
result.reserve(v.size());
std::mutex mtx;
std::for_each(std::execution::par, v.begin(), v.end(), [&](int x) {
    if (x > 0) {
        std::lock_guard<std::mutex> lock(mtx);
        result.push_back(x);
    }
});
// 또는 copy_if + par
std::vector<int> result(v.size());
auto end = std::copy_if(std::execution::par, v.begin(), v.end(), result.begin(),
                         { return x > 0; });
result.erase(end, result.end());

에러 4: 람다 캡처로 use-after-free

증상: 크래시, 쓰레기 값.

// ❌ 참조 캡처 — 스코프 벗어나면 무효
void process_async(const std::vector<int>& data) {
    std::for_each(std::execution::par, data.begin(), data.end(),
                  [&data](int x) {
                      use(data, x);  // data는 process_async 반환 후 무효 가능
                  });
}

해결법:

// ✅ 값 캡처 또는 반복자 범위만 캡처
void process_async(std::vector<int> data) {  // 복사 또는 move
    std::for_each(std::execution::par, data.begin(), data.end(),
                  [&data](int x) { use(data, x); });
}

에러 5: 부동소수점 reduce vs accumulate 차이

증상: std::reduce 결과가 std::accumulate와 미세하게 다름.

// accumulate: 순서 보장 (a + b + c + d)
// reduce: 부분 합 병합 ((a+b) + (c+d)) — 결합 순서 비결정
std::vector<float> v(1000000, 0.1f);
float acc = std::accumulate(v.begin(), v.end(), 0.0f);
float red = std::reduce(std::execution::par, v.begin(), v.end(), 0.0f);
// acc != red (부동소수점 누적 오차로 인해)

해결법: 순서가 중요하면 accumulate(순차). 대량 데이터에서 성능이 중요하고 미세 오차가 허용되면 reduce(병렬). Kahan summation이 필요하면 별도 구현.

에러 6: 빈 범위 또는 단일 원소

증상: 일부 구현에서 예외 또는 비정상 동작.

// 빈 벡터
std::vector<int> empty;
std::sort(std::execution::par, empty.begin(), empty.end());  // OK (no-op)
// 단일 원소 — 병렬화 이득 없지만 안전
std::vector<int> single = {42};
std::sort(std::execution::par, single.begin(), single.end());  // OK

해결법: 표준은 빈 범위를 허용합니다. 구현체에 따라 작은 크기에서는 순차로 폴백할 수 있으므로, 임계값(예: 1000 이상) 이상에서만 par를 쓰는 선택적 적용도 가능합니다.

에러 7: MSVC에서 execution 헤더 링크 오류

증상: LNK2019: unresolved external symbol (parallel algorithms). 해결법: MSVC는 병렬 알고리즘에 Intel TBB 또는 Microsoft PPL을 사용합니다. vcpkg로 TBB 설치:

vcpkg install tbb:x64-windows

그리고 프로젝트에 링크:

# CMake
find_package(TBB REQUIRED)
target_link_libraries(myapp TBB::tbb)

에러 8: 정렬·검색 기준 불일치

증상: std::lower_bound 등으로 찾은 결과가 기대와 다름.

// ❌ 정렬은 name, 검색은 id — UB
std::sort(std::execution::par, users.begin(), users.end(),
          { return a.name < b.name; });
auto it = std::lower_bound(users.begin(), users.end(), target_id,
     { return u.id < id; });

해결법: 정렬 기준과 검색 기준이 동일해야 합니다.

// ✅ id로 정렬 후 id로 검색
std::sort(std::execution::par, users.begin(), users.end(),
          { return a.id < b.id; });
auto it = std::lower_bound(users.begin(), users.end(), target_id,
     { return u.id < id; });

에러 9: 커스텀 비교자에서 스레드 안전 위반

증상: 병렬 정렬 시 간헐적 크래시, 잘못된 결과.

// ❌ 비교자가 캡처한 상태를 수정함
int counter = 0;
std::sort(std::execution::par, v.begin(), v.end(),
    [&counter](int a, int b) {
        counter++;  // 데이터 레이스!
        return a < b;
    });

해결법: 비교자·함수 객체는 상태 불변이어야 합니다. 읽기 전용 캡처만 사용합니다.

// ✅ 순수 함수
std::sort(std::execution::par, v.begin(), v.end(),
     { return a < b; });

에러 10: 반복자 종류 제한

증상: std::sort(std::execution::par, list.begin(), list.end()) — 컴파일 에러. 원인: std::sort는 RandomAccessIterator가 필요합니다. std::list는 BidirectionalIterator만 제공합니다. 해결법: std::list는 list.sort() 멤버 함수를 사용합니다. 병렬 정렬이 필요하면 std::vector로 복사 후 정렬 후 다시 복사하거나, 다른 자료구조를 사용합니다.

// std::list → vector로 복사 후 병렬 정렬
std::list<int> lst = {3, 1, 4, 1, 5};
std::vector<int> vec(lst.begin(), lst.end());
std::sort(std::execution::par, vec.begin(), vec.end());
lst.assign(vec.begin(), vec.end());

5. 베스트 프랙티스

1. 작은 데이터는 seq 유지

// 100개 미만: 병렬 오버헤드가 이득보다 클 수 있음
if (v.size() < 1000) {
    std::sort(std::execution::seq, v.begin(), v.end());
} else {
    std::sort(std::execution::par, v.begin(), v.end());
}

2. 독립성 확인 후 par_unseq

// 원소 간 독립 + 동기화 없음 → par_unseq
std::transform(std::execution::par_unseq, a.begin(), a.end(), b.begin(),
                { return std::sqrt(x); });
// 공유 상태 접근 → par만
std::for_each(std::execution::par, v.begin(), v.end(), [&](int x) {
    std::lock_guard<std::mutex> lock(mtx);
    log(x);
});

3. reserve로 재할당 방지

std::vector<double> result;
result.resize(input.size());  // 또는 reserve + back_inserter
std::transform(std::execution::par, input.begin(), input.end(),
               result.begin(), transform_fn);

4. 예외 안전성

// 병렬 알고리즘에서 예외 발생 시 std::terminate 호출 가능
// 람다 내부에서 예외를 잡아 처리하거나, noexcept 보장
std::transform(std::execution::par, a.begin(), a.end(), b.begin(),
                noexcept { return x * 2; });

5. 프로파일링 후 적용

// 추측이 아닌 측정
auto start = std::chrono::high_resolution_clock::now();
std::sort(std::execution::par, v.begin(), v.end());
auto end = std::chrono::high_resolution_clock::now();
// 병목이 확인된 부분만 병렬화

6. 연속 메모리 활용

// ✅ std::vector — 연속 메모리, 캐시 친화적
std::vector<double> data(1'000'000);
std::transform(std::execution::par, data.begin(), data.end(), data.begin(),
               { return std::sqrt(x); });
// ⚠️ std::deque — 연속이 아님, 병렬 알고리즘은 동작하지만 캐시 효율 낮음

7. 이동 시맨틱 활용

// ✅ 이동으로 불필요한 복사 제거
std::vector<BigObject> process_parallel(std::vector<BigObject> input) {
    std::vector<BigObject> result(input.size());
    std::transform(std::execution::par,
                  std::make_move_iterator(input.begin()),
                  std::make_move_iterator(input.end()),
                  result.begin(),
                   { return process(std::move(obj)); });
    return result;
}

8. 조건부 병렬화

#include <algorithm>
#include <execution>
#include <vector>
template <typename It>
void sort_adaptive(It first, It last) {
    constexpr size_t PARALLEL_THRESHOLD = 10'000;
    if (std::distance(first, last) < PARALLEL_THRESHOLD) {
        std::sort(std::execution::seq, first, last);
    } else {
        std::sort(std::execution::par, first, last);
    }
}

6. 성능 벤치마크

벤치마크 예제 (sort)

#include <algorithm>
#include <execution>
#include <vector>
#include <random>
#include <chrono>
#include <iostream>
int main() {
    const size_t N = 10'000'000;
    std::vector<int> v(N);
    std::mt19937 gen(42);
    std::uniform_int_distribution<> dis(0, 1'000'000);
    for (auto& x : v) x = dis(gen);
    auto seq = v;
    auto par = v;
    auto t1 = std::chrono::high_resolution_clock::now();
    std::sort(std::execution::seq, seq.begin(), seq.end());
    auto t2 = std::chrono::high_resolution_clock::now();
    std::sort(std::execution::par, par.begin(), par.end());
    auto t3 = std::chrono::high_resolution_clock::now();
    auto ms_seq = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count();
    auto ms_par = std::chrono::duration_cast<std::chrono::milliseconds>(t3 - t2).count();
    std::cout << "seq: " << ms_seq << " ms, par: " << ms_par << " ms, speedup: "
              << (double)ms_seq / ms_par << "x\n";
    return 0;
}

예상 결과 (8코어 CPU, 1000만 int)

알고리즘	seq	par	가속비
sort	~800ms	~150ms	~5x
transform (1M float)	~4ms	~1ms	~4x
reduce (10M double)	~25ms	~6ms	~4x

7. 프로덕션 패턴

패턴 1: 크기 기반 정책 선택

template <typename It>
void sort_adaptive(It first, It last) {
    auto n = std::distance(first, last);
    if (n < 10'000) {
        std::sort(std::execution::seq, first, last);
    } else {
        std::sort(std::execution::par, first, last);
    }
}

패턴 2: 파이프라인 — transform → filter → reduce

// 1. 변환 (병렬)
std::vector<double> transformed(data.size());
std::transform(std::execution::par, data.begin(), data.end(),
               transformed.begin(),  { return std::sqrt(x); });
// 2. 필터 (병렬)
std::vector<double> filtered;
std::copy_if(std::execution::par, transformed.begin(), transformed.end(),
             std::back_inserter(filtered),  { return x > 0; });
// 3. 집계 (병렬)
double sum = std::reduce(std::execution::par, filtered.begin(), filtered.end(), 0.0);

패턴 3: 배치 처리 — 청크 단위 병렬

void process_batch_parallel(const std::vector<Item>& items,
                            std::vector<Result>& results) {
    results.resize(items.size());
    std::transform(std::execution::par,
                   items.begin(), items.end(),
                   results.begin(),
                    { return process(item); });
}

패턴 4: std::execution과 스레드 풀 병행

// 병렬 알고리즘: 데이터 중심, 일괄 처리
std::sort(std::execution::par, v.begin(), v.end());
// 스레드 풀: 작업 중심, 비동기 이벤트
pool.submit([&]() { handle_request(req); });

패턴 5: 예외 처리 래퍼

template <typename Policy, typename It, typename....Args>
auto safe_parallel_sort(Policy policy, It first, It last, Args&&....args) {
    try {
        std::sort(policy, first, last, std::forward<Args>(args)...);
    } catch (const std::exception& e) {
        // 로깅, 폴백 처리
        std::sort(std::execution::seq, first, last, std::forward<Args>(args)...);
    }
}

패턴 6: Map-Reduce 스타일 집계

// 1단계: 각 청크별 부분 결과 (map)
// 2단계: 부분 결과 병합 (reduce)
struct Stats {
    double sum = 0;
    size_t count = 0;
};
Stats aggregate_parallel(const std::vector<double>& v) {
    return std::transform_reduce(
        std::execution::par,
        v.begin(), v.end(),
        Stats{},
         {
            return Stats{a.sum + b.sum, a.count + b.count};
        },
         { return Stats{x, 1}; });
}

패턴 7: 병렬 정렬 + 이진 검색 파이프라인

// 대량 데이터 정렬 후 반복 검색
void build_lookup_table(std::vector<std::pair<int, Data>>& table) {
    std::sort(std::execution::par, table.begin(), table.end(),
               { return a.first < b.first; });
}
Data lookup(const std::vector<std::pair<int, Data>>& table, int key) {
    auto it = std::lower_bound(table.begin(), table.end(), key,
         { return p.first < k; });
    return (it != table.end() && it->first == key) ? it->second : Data{};
}

패턴 8: 스레드 로컬 버퍼 + 병렬 병합

// 각 스레드가 로컬 버퍼에 쓰고, 마지막에 병합
std::vector<int> process_with_local_buffers(const std::vector<int>& input) {
    thread_local std::vector<int> local;
    local.clear();
    std::mutex mtx;
    std::vector<int> global;
    std::for_each(std::execution::par, input.begin(), input.end(),
        [&](int x) {
            int result = process(x);
            {
                std::lock_guard<std::mutex> lock(mtx);
                global.push_back(result);
            }
        });
    return global;
}

패턴 9: CPU 코어 수 기반 청크 분할

#include <thread>
#include <algorithm>
#include <execution>
// 라이브러리가 자동으로 처리하지만, 수동 제어가 필요할 때
size_t optimal_chunk_size(size_t total) {
    size_t cores = std::thread::hardware_concurrency();
    return std::max(size_t(1), total / (cores * 4));  // 코어당 4청크
}

패턴 10: 병렬 알고리즘 + 스레드 풀 혼합

// 배치 데이터: std::execution::par (데이터 병렬)
void process_batch(std::vector<Item>& items) {
    std::transform(std::execution::par, items.begin(), items.end(),
                  items.begin(), process_item);
}
// 이벤트 기반 작업: 스레드 풀 (작업 병렬)
void handle_requests(ThreadPool& pool, const std::vector<Request>& reqs) {
    for (const auto& req : reqs) {
        pool.submit([req]() { handle_request(req); });
    }
}

8. 정리 및 체크리스트

핵심 요약

항목	내용
std::execution::par	멀티스레드 병렬, 락·공유 변수 수정 시 주의
std::execution::par_unseq	병렬 + SIMD, 동기화 프리 필수
sort	병렬 정렬, 대용량에서 4~8배 가속
transform	병렬 변환, 픽셀·배열 처리 등
reduce	병렬 집계, 부동소수점 순서 비결정 주의

구현 체크리스트

프로파일러로 병목 확인
데이터 레이스 없음 (공유 변수 수정 금지)
par_unseq 사용 시 락·atomic·공유 상태 접근 없음
작은 데이터(<1000)는 seq 유지 검토
MSVC 사용 시 TBB 링크 확인
예외 처리 또는 noexcept 보장
reserve/resize로 출력 버퍼 사전 할당

참고 자료

실무 팁

개발 시 주의사항

[팁 1]: [설명]
```
// 예시 코드
```
[팁 2]: [설명]
```
// 예시 코드
```
[팁 3]: [설명]

디버깅 방법

[방법 1]: [설명]
[방법 2]: [설명]
[방법 3]: [설명]

FAQ

Q. 병렬 알고리즘을 언제 적용해야 하나요? A. 프로파일에서 `std::sort`, `std::transform`, `std::reduce` 등이 병목일 때, 대량 데이터(수만~수백만 이상)에서 `std::execution::par`를 적용합니다. 1000개 미만에서는 오버헤드가 이득보다 클 수 있어 seq를 유지하는 것이 좋습니다. Q. par와 par_unseq 중 뭘 써야 하나요? A. 먼저 `par`로 병렬화해 효과를 확인합니다. 람다가 완전히 독립적이고(락·atomic·공유 변수 없음) 추가 가속이 필요하면 `par_unseq`를 시도합니다. `par_unseq` 위반 시 UB이므로 주의가 필요합니다. Q. std::accumulate를 std::reduce로 바꿔도 되나요? A. 합계·곱·최대값처럼 결합 법칙이 성립하면 `reduce`로 바꿔도 됩니다. 부동소수점은 `accumulate`와 미세하게 다른 결과가 나올 수 있으므로, 수치 정확도가 중요한 경우에는 검토가 필요합니다. Q. Windows에서 링크 에러가 나요. A. MSVC 병렬 알고리즘은 Intel TBB 또는 Windows SDK의 병렬 런타임을 사용합니다. `vcpkg install tbb` 후 `target_link_libraries(myapp TBB::tbb)`를 추가하세요. 한 줄 요약: `std::execution::par`로 sort·transform·reduce를 한 줄에 병렬화할 수 있습니다. 데이터 레이스와 par_unseq 제약을 지키면 멀티코어를 안전하게 활용할 수 있습니다. 다음으로 데이터베이스 쿼리 최적화 #51-8를 읽어보면 좋습니다.

C++ 스레드 풀 완벽 가이드 | 작업 큐·병렬 처리·성능 벤치마크 [#51-3]
C++ 고급 프로파일링 완벽 가이드 | perf·gprof
C++ Execution Policy |
C++ 알고리즘 |

심화 부록: 구현·운영 관점

이 부록은 앞선 본문에서 다룬 주제(「C++ 병렬 알고리즘 완벽 가이드 | std::execution::par·par_unseq」)를 구현·런타임·운영 관점에서 다시 압축합니다. 도메인별 세부 구현은 글마다 다르지만, 입력 검증 → 핵심 연산 → 부작용(I/O·네트워크·동시성) → 관측의 흐름으로 장애를 나누면 원인 추적이 빨라집니다.

내부 동작과 핵심 메커니즘

flowchart TD
  A[입력·요청·이벤트] --> B[파싱·검증·디코딩]
  B --> C[핵심 연산·상태 전이]
  C --> D[부작용: I/O·네트워크·동시성]
  D --> E[결과·관측·저장]

sequenceDiagram
  participant C as 클라이언트/호출자
  participant B as 경계(런타임·게이트웨이·프로세스)
  participant D as 의존성(API·DB·큐·파일)
  C->>B: 요청/이벤트
  B->>D: 조회·쓰기·RPC
  D-->>B: 지연·부분 실패·재시도 가능
  B-->>C: 응답 또는 오류(코드·상관 ID)

불변 조건(Invariant): 버퍼 경계, 프로토콜 상태, 트랜잭션 격리, FD 상한 등 단계별로 문장으로 적어 두면 디버깅 비용이 줄어듭니다.
결정성: 순수 층과 시간·네트워크·스케줄에 의존하는 층을 분리해야 테스트와 장애 분석이 쉬워집니다.
경계 비용: 직렬화, 인코딩, syscall 횟수, 락 경합, 할당·GC, 캐시 미스를 의심 목록에 둡니다.
백프레셔: 생산자가 소비자보다 빠를 때 버퍼·큐·스트림에서 속도를 줄이는 신호를 어디에 둘지 정의합니다.

프로덕션 운영 패턴

영역	운영 관점 질문
관측성	요청 단위 상관 ID, 에러율·지연 p95/p99, 의존성 타임아웃·재시도가 대시보드에 보이는가
안전성	입력 검증·권한·비밀·감사 로그가 코드 경로마다 일관적인가
신뢰성	재시도는 멱등 연산에만 적용되는가, 서킷 브레이커·백오프·DLQ가 있는가
성능	캐시·배치 크기·커넥션 풀·인덱스·백프레셔가 데이터 규모에 맞는가
배포	롤백 룬북, 카나리/블루그린, 마이그레이션·피처 플래그가 문서화되어 있는가
용량	피크 트래픽·디스크·FD·스레드 풀 상한을 주기적으로 검증하는가

스테이징은 데이터 양·네트워크 RTT·동시성을 프로덕션에 가깝게 맞출수록 재현율이 올라갑니다.

확장 예시: 엔드투엔드 미니 시나리오

앞선 본문 주제(「C++ 병렬 알고리즘 완벽 가이드 | std::execution::par·par_unseq」)를 배포·운영 흐름에 맞춰 옮긴 체크리스트입니다. 도메인에 맞게 단계 이름만 바꿔 적용할 수 있습니다.

입력 계약 고정: 스키마·버전·최대 페이로드·타임아웃·에러 코드를 경계에 둔다.
핵심 경로 계측: 요청 ID, 단계별 지연, 외부 호출 결과 코드를 로그·메트릭·트레이스에서 한 흐름으로 본다.
실패 주입: 의존성 타임아웃·5xx·부분 데이터·락 대기를 스테이징에서 재현한다.
호환·롤백: 설정/마이그레이션/클라이언트 버전을 되돌릴 수 있는지 확인한다.
부하 후 검증: 피크 대비 p95/p99, 에러율, 리소스 상한, 알림 임계값을 점검한다.

handle(request):
  ctx = newCorrelationId()
  validated = validateSchema(request)
  authorize(validated, ctx)
  result = domainCore(validated)
  persistOrEmit(result, idempotentKey)
  recordMetrics(ctx, latency, outcome)
  return result

문제 해결(Troubleshooting)

증상	가능 원인	조치
간헐적 실패	레이스, 타임아웃, 외부 의존성, DNS	최소 재현 스크립트, 분산 트레이스·로그 상관관계, 재시도·서킷 설정 점검
성능 저하	N+1, 동기 I/O, 락 경합, 과도한 직렬화, 캐시 미스	프로파일러·APM으로 핫스팟 확인 후 한 가지씩 제거
메모리 증가	캐시 무제한, 구독/리스너 누수, 대용량 버퍼, 커넥션 미반납	상한·TTL·힙/FD 스냅샷 비교
빌드·배포만 실패	환경 변수, 권한, 플랫폼 차이, lockfile	CI 로그와 로컬 diff, 런타임·이미지 버전 핀
설정 불일치	프로필·시크릿·기본값, 리전	스키마 검증된 설정 단일 소스와 배포 매트릭스 표준화
데이터 불일치	비멱등 재시도, 부분 쓰기, 캐시 무효화 누락	멱등 키·아웃박스·트랜잭션 경계 재검토

권장 순서: (1) 최소 재현 (2) 최근 변경 범위 축소 (3) 환경·의존성 차이 (4) 관측으로 가설 검증 (5) 수정 후 회귀·부하 테스트.

배포 전에는 git add → git commit → git push 후 npm run deploy 순서를 권장합니다.

같이 보면 좋은 글 (내부 링크)

이 주제와 연결되는 다른 글입니다.

C++ STL 알고리즘 기초 | sort·find·count·transform·accumulate 가이드
[C++ SIMD와 병렬화: std::execution과 인트린직 가이드](/en/blog/cpp-series-39-3-simd-execution-intrinsics/
C++ SIMD와 병렬화: std::execution과 인트린직 가이드

이 글에서 다루는 키워드 (관련 검색어)

C++, 병렬알고리즘, std::execution, par, par_unseq, 병렬정렬, 성능최적화 등으로 검색하시면 이 글이 도움이 됩니다.

이 글이 도움이 되셨나요?

여러분의 피드백은 더 나은 콘텐츠를 만드는 데 도움이 됩니다

문제가 있거나 개선 제안이 있으시면 연락처로 알려주세요.

Keyboard Shortcuts

이 글의 핵심

들어가며: 멀티코어가 있는데 한 코어만 쓰고 있어요

”100만 개 정렬하는데 8코어 중 1개만 100% 사용해요”

1. 문제 시나리오

시나리오 1: 대용량 배열 정렬이 병목일 때

시나리오 2: 이미지 픽셀 변환이 느릴 때

시나리오 3: 대량 데이터 집계가 병목일 때

시나리오 4: ETL 파이프라인에서 변환 단계가 느릴 때

시나리오 5: 스레드 풀 없이 간단히 병렬화하고 싶을 때

시나리오 6: par_unseq로 SIMD까지 활용하고 싶을 때

2. std::execution 정책 완전 가이드

정책 비교

seq vs par

par vs par_unseq

3. 병렬 sort·transform·reduce 완전 예제

예제 1: std::execution::par — 병렬 정렬

예제 2: std::execution::par_unseq — 병렬 + SIMD 변환

예제 3: 병렬 reduce — 합계·내적·최대값

예제 4: 병렬 for_each — 인덱스 기반 처리

예제 5: 병렬 count_if — 조건 만족 개수

예제 6: 병렬 파이프라인 — transform + reduce

지원되는 병렬 알고리즘 목록

예제 7: 병렬 partial_sort — 상위 K개만 정렬

예제 8: 병렬 inclusive_scan — 누적 합 (C++17)

예제 9: 병렬 all_of / any_of / none_of

예제 10: 병렬 fill_n — 대량 초기화

4. 자주 발생하는 에러와 해결법

에러 진단 플로우

에러 1: 데이터 레이스 — 공유 변수 수정

에러 2: par_unseq에서 락 사용

에러 3: 반복자 무효화

에러 4: 람다 캡처로 use-after-free

에러 5: 부동소수점 reduce vs accumulate 차이

에러 6: 빈 범위 또는 단일 원소

에러 7: MSVC에서 execution 헤더 링크 오류

에러 8: 정렬·검색 기준 불일치

에러 9: 커스텀 비교자에서 스레드 안전 위반

에러 10: 반복자 종류 제한

5. 베스트 프랙티스

1. 작은 데이터는 seq 유지

2. 독립성 확인 후 par_unseq

3. reserve로 재할당 방지

4. 예외 안전성

5. 프로파일링 후 적용

6. 연속 메모리 활용

7. 이동 시맨틱 활용

8. 조건부 병렬화

6. 성능 벤치마크

벤치마크 예제 (sort)

예상 결과 (8코어 CPU, 1000만 int)

7. 프로덕션 패턴

패턴 1: 크기 기반 정책 선택

패턴 2: 파이프라인 — transform → filter → reduce

패턴 3: 배치 처리 — 청크 단위 병렬

패턴 4: std::execution과 스레드 풀 병행

패턴 5: 예외 처리 래퍼

패턴 6: Map-Reduce 스타일 집계

패턴 7: 병렬 정렬 + 이진 검색 파이프라인

패턴 8: 스레드 로컬 버퍼 + 병렬 병합

패턴 9: CPU 코어 수 기반 청크 분할

패턴 10: 병렬 알고리즘 + 스레드 풀 혼합

8. 정리 및 체크리스트

핵심 요약

구현 체크리스트

참고 자료

실무 팁

개발 시 주의사항

디버깅 방법

FAQ

관련 글

심화 부록: 구현·운영 관점

내부 동작과 핵심 메커니즘

프로덕션 운영 패턴

확장 예시: 엔드투엔드 미니 시나리오

문제 해결(Troubleshooting)

같이 보면 좋은 글 (내부 링크)

이 글에서 다루는 키워드 (관련 검색어)

이 글이 도움이 되셨나요?