C++ 멀티스레드 성능 튜닝 | 락 프리·작업 훔치기·스레드 풀 [#51-3]

Q: 선행으로 읽으면 좋은 글은?

각 글 하단의 이전 글 링크를 따라가면 순서대로 배울 수 있습니다. C++ 시리즈 목차에서 전체 흐름을 확인할 수 있습니다.

2026년 4월 3일 · 28분 읽기 · 수정 2026년 4월 3일 고급 실습

이 글의 핵심

C++ 고급 멀티스레드: lock-free 자료구조, work stealing, thread pool 최적화, false sharing 방지. 상황: 게임 서버에서 물리 연산을 병렬화했습니다. 4코어에서 4스레드로 돌리니 2배 빨라졌습니다. 8코어로 서버를 업그레이드하고 8스레드로 늘렸는데, 처리량이 4스레드보다 오히려 떨어졌습니다. 원인: 락 경합(lock contention), false sharing,.

들어가며: “스레드를 늘렸는데 오히려 느려졌어요”

실제 겪는 문제 시나리오

상황: 게임 서버에서 물리 연산을 병렬화했습니다. 4코어에서 4스레드로 돌리니 2배 빨라졌습니다. 8코어로 서버를 업그레이드하고 8스레드로 늘렸는데, 처리량이 4스레드보다 오히려 떨어졌습니다.

원인: 락 경합(lock contention), false sharing, 스레드 풀 부하 불균형 등이 복합적으로 작용했습니다. 코어 수만 늘리면 자동으로 빨라지는 것이 아니라, 메모리 접근 패턴과 동기화 구조를 튜닝해야 합니다.

해결: 이 글에서는 락 프리 자료구조, 작업 훔치기(work stealing), 스레드 풀 최적화, false sharing 방지 등 고급 멀티스레드 튜닝 기법을 실전 코드와 함께 다룹니다.

추가 문제 시나리오

시나리오 2: 로그 서버의 락 병목

초당 10만 건의 로그를 처리하는 서버에서, 모든 워커가 하나의 std::mutex로 보호된 큐에서 작업을 가져옵니다. 스레드 수를 16개로 늘렸는데 8개일 때보다 처리량이 30% 감소했습니다. 락 경합이 병목이었습니다.

시나리오 3: 캐시 라인 경합으로 인한 성능 저하

8개 스레드가 각각 카운터를 증가시키는데, 배열에 연속으로 배치했습니다. int counter[8]처럼 4바이트씩 나란히 있으면 같은 캐시 라인(보통 64바이트)을 공유해, 한 스레드가 쓸 때마다 다른 스레드의 캐시가 무효화됩니다. False sharing입니다.

시나리오 4: 워크 스틸링 없이 한 워커만 바쁨

작업 큐를 하나만 쓰고, 작업 크기가 들쭉날쭉합니다. 큰 작업 10개가 먼저 들어가 한 워커에 몰리고, 나머지 워커들은 놀고 있습니다. 부하 불균형이 발생합니다.

flowchart TB
    subgraph 문제["문제 상황"]
        P1[락 경합] --> S1[처리량 저하]
        P2[False Sharing] --> S2[캐시 무효화 폭증]
        P3[부하 불균형] --> S3[코어 활용도 낮음]
    end
    subgraph 해결["해결 기법"]
        H1[Lock-Free] --> R1[락 경합 제거]
        H2[캐시 라인 정렬] --> R2[독립 캐시 라인]
        H3[Work Stealing] --> R3[작업 재분배]
    end

목표:

Lock-free: 락 없이 원자 연산으로 동기화
Work stealing: 바쁜 워커의 작업을 다른 워커가 가져가기
스레드 풀 최적화: 적절한 풀 크기, 작업 분배
False sharing 방지: 캐시 라인 정렬

이 글을 읽으면:

락 프리 큐·스택의 구현 원리를 이해할 수 있습니다.
작업 훔치기 스레드 풀을 직접 구현할 수 있습니다.
false sharing을 진단하고 해결할 수 있습니다.
자주 발생하는 에러와 성능 팁을 적용할 수 있습니다.

1. 문제 시나리오와 진단

Before: 락 경합으로 인한 병목

// ❌ 나쁜 예: 단일 락으로 모든 워커가 대기
#include <mutex>
#include <queue>
#include <thread>
#include <functional>

class NaiveTaskQueue {
    std::queue<std::function<void()>> queue_;
    std::mutex mutex_;
    std::condition_variable cv_;
public:
    void push(std::function<void()> task) {
        std::lock_guard<std::mutex> lock(mutex_);
        queue_.push(std::move(task));
        cv_.notify_one();
    }
    std::function<void()> pop() {
        std::unique_lock<std::mutex> lock(mutex_);
        cv_.wait(lock, [this]{ return !queue_.empty(); });
        auto task = std::move(queue_.front());
        queue_.pop();
        return task;
    }
};

문제: 16개 워커가 모두 pop()에서 같은 mutex_를 기다립니다. 작업 하나 처리할 때마다 락을 잡고 놓는 오버헤드가 커지고, 스레드 수가 늘수록 락 대기 시간이 증가합니다.

성능 프로파일링으로 병목 찾기

// 병목 진단: mutex 대기 시간 측정
#include <chrono>
#include <iostream>

void profile_lock_contention() {
    std::mutex m;
    auto start = std::chrono::high_resolution_clock::now();
    for (int i = 0; i < 1000000; ++i) {
        std::lock_guard<std::mutex> lock(m);  // 매번 락/언락
        // 아무 작업 없음
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
    std::cout << "100만 락/언락: " << ms << " ms\n";
}

해석: 단일 스레드에서도 100만 번 락/언락에 수십 ms가 걸립니다. 멀티스레드에서는 대기 시간이 추가되어 병목이 됩니다.

False Sharing이란?

캐시 라인(보통 64바이트)은 CPU 캐시의 최소 단위입니다. 서로 다른 변수 A, B가 같은 캐시 라인에 있으면, 스레드 1이 A를 수정할 때 해당 캐시 라인이 무효화되고, 스레드 2가 B를 읽으려 해도 캐시 미스가 발생해 메모리에서 다시 가져와야 합니다. A와 B는 논리적으로 독립인데, 물리적으로 같은 라인을 공유해 가짜 공유(false sharing)가 발생합니다.

flowchart LR
    subgraph 캐시라인["캐시 라인 (64B)"]
        A[변수 A] 
        B[변수 B]
    end
    T1[스레드1: A 수정] --> A
    T2[스레드2: B 읽기] --> B
    A -.->|캐시 무효화| B

❌ 잘못된 예: 연속 배열의 카운터

// ❌ 나쁜 예: 같은 캐시 라인에 8개 카운터
#include <atomic>
#include <thread>
#include <vector>

void bad_per_thread_counter() {
    std::atomic<int> counters[8];  // 32바이트, 한 캐시 라인에 다 들어감!
    std::vector<std::thread> threads;
    for (int t = 0; t < 8; ++t) {
        threads.emplace_back([&counters, t]() {
            for (int i = 0; i < 1000000; ++i) {
                counters[t]++;  // 매번 다른 스레드 캐시 무효화
            }
        });
    }
    for (auto& th : threads) th.join();
}

✅ 올바른 예: 캐시 라인 정렬

// ✅ 좋은 예: 캐시 라인 경계에 정렬
#include <atomic>
#include <thread>
#include <new>

// 캐시 라인 크기 (x86-64: 64바이트)
constexpr size_t CACHE_LINE_SIZE = 64;

struct alignas(CACHE_LINE_SIZE) PaddedCounter {
    std::atomic<int> value{0};
    char padding[CACHE_LINE_SIZE - sizeof(std::atomic<int>)];
};

void good_per_thread_counter() {
    PaddedCounter counters[8];  // 각각 독립 캐시 라인
    std::vector<std::thread> threads;
    for (int t = 0; t < 8; ++t) {
        threads.emplace_back([&counters, t]() {
            for (int i = 0; i < 1000000; ++i) {
                counters[t].value++;
            }
        });
    }
    for (auto& th : threads) th.join();
}

핵심: alignas(CACHE_LINE_SIZE)로 각 카운터가 서로 다른 캐시 라인에 배치되도록 합니다. C++17에서는 std::hardware_destructive_interference_size를 사용할 수 있습니다.

// C++17: 플랫폼 독립적
struct PaddedCounter17 {
    std::atomic<int> value{0};
    char padding[std::hardware_destructive_interference_size - sizeof(std::atomic<int>)];
};

3. Lock-Free 자료구조

Lock-Free 스택 (ABA 문제 고려)

Lock-free는 “락을 쓰지 않는다”는 뜻입니다. std::atomic과 CAS(Compare-And-Swap)로 구현합니다. 단순 스택은 head 포인터 하나만 원자적으로 업데이트하면 됩니다.

// Lock-free 스택 (단순 버전)
#include <atomic>
#include <memory>

template<typename T>
class LockFreeStack {
    struct Node {
        T data;
        Node* next;
        Node(const T& d) : data(d), next(nullptr) {}
    };
    std::atomic<Node*> head_{nullptr};

public:
    void push(const T& value) {
        Node* new_node = new Node(value);
        new_node->next = head_.load(std::memory_order_relaxed);
        while (!head_.compare_exchange_weak(new_node->next, new_node,
                std::memory_order_release, std::memory_order_relaxed)) {
            // CAS 실패 시 new_node->next가 현재 head로 갱신됨, 재시도
        }
    }

    bool pop(T& value) {
        Node* old_head = head_.load(std::memory_order_relaxed);
        while (old_head && !head_.compare_exchange_weak(old_head, old_head->next,
                std::memory_order_acquire, std::memory_order_relaxed)) {
            // CAS 실패 시 old_head가 현재 head로 갱신됨
        }
        if (!old_head) return false;
        value = old_head->data;
        delete old_head;  // ABA 문제: 다른 스레드가 push-pop-push 하면 위험
        return true;
    }
};

주의: 위 코드는 ABA 문제에 취약합니다. 프로덕션에서는 hazard pointer나 epoch-based reclamation을 사용합니다. 여기서는 개념 이해용입니다.

Lock-Free 원형 큐 (단일 생산자-단일 소비자)

SPSC(Single Producer Single Consumer) 큐는 락 없이 구현하기 쉽습니다.

// SPSC Lock-free 큐
#include <atomic>
#include <array>

template<typename T, size_t N>
class SPSCQueue {
    std::array<T, N> buffer_;
    std::atomic<size_t> head_{0};
    std::atomic<size_t> tail_{0};

public:
    bool push(const T& value) {
        size_t current_tail = tail_.load(std::memory_order_relaxed);
        size_t next_tail = (current_tail + 1) % N;
        if (next_tail == head_.load(std::memory_order_acquire)) {
            return false;  // 큐 가득 참
        }
        buffer_[current_tail] = value;
        tail_.store(next_tail, std::memory_order_release);
        return true;
    }

    bool pop(T& value) {
        size_t current_head = head_.load(std::memory_order_relaxed);
        if (current_head == tail_.load(std::memory_order_acquire)) {
            return false;  // 큐 비어 있음
        }
        value = buffer_[current_head];
        head_.store((current_head + 1) % N, std::memory_order_release);
        return true;
    }
};

메모리 순서: acquire/release로 push에서 쓴 데이터가 pop에서 보이도록 보장합니다.

4. 스레드 풀 최적화

풀 크기 결정

코어 수 = 스레드 수가 항상 최선은 아닙니다. I/O 바운드 작업은 스레드를 더 많이 두고, CPU 바운드 작업은 코어 수에 맞춥니다.

// 풀 크기 결정 공식
#include <thread>

unsigned get_optimal_pool_size(bool io_bound = false) {
    unsigned hw = std::thread::hardware_concurrency();
    if (hw == 0) hw = 4;
    if (io_bound) {
        return hw * 2;  // I/O 대기 시 다른 스레드가 CPU 사용
    }
    return hw;  // CPU 바운드: 코어 수와 동일
}

작업 배치(Batching)로 락 오버헤드 감소

작업을 하나씩 넣고 빼면 락을 자주 잡습니다. 배치로 여러 작업을 한 번에 넣고 빼면 락 횟수가 줄어듭니다.

// 배치 push로 락 횟수 감소
#include <vector>
#include <mutex>
#include <queue>

template<typename T>
class BatchQueue {
    std::queue<T> queue_;
    std::mutex mutex_;
public:
    void push_batch(std::vector<T>&& batch) {
        std::lock_guard<std::mutex> lock(mutex_);
        for (auto& item : batch) {
            queue_.push(std::move(item));
        }
    }
    bool try_pop_batch(std::vector<T>& out, size_t max_size = 32) {
        std::lock_guard<std::mutex> lock(mutex_);
        while (!queue_.empty() && out.size() < max_size) {
            out.push_back(std::move(queue_.front()));
            queue_.pop();
        }
        return !out.empty();
    }
};

5. 작업 훔치기 (Work Stealing)

개념

각 워커가 자기 전용 큐를 갖습니다. 자기 큐가 비면 다른 워커의 큐 뒤쪽에서 작업을 훔쳐옵니다. 앞쪽은 해당 워커가 사용하므로, 뒤쪽을 훔치면 경합이 적습니다.

flowchart LR
    subgraph 워커1["워커 1"]
        Q1[큐: t1 t2 t3 t4]
        W1[처리]
    end
    subgraph 워커2["워커 2"]
        Q2[큐: 비어있음]
        W2[훔치기]
    end
    W2 -.->|t4 훔침| Q1

Work-Stealing 스레드 풀 구현

// Work-Stealing 스레드 풀 (핵심 구조)
#include <deque>
#include <mutex>
#include <thread>
#include <functional>
#include <vector>
#include <condition_variable>

class WorkStealingPool {
    using Task = std::function<void()>;
    std::vector<std::deque<Task>> queues_;  // 워커당 하나의 덱
    std::vector<std::mutex> mutexes_;
    std::vector<std::condition_variable> cvs_;
    std::atomic<bool> done_{false};
    std::vector<std::thread> workers_;
    const size_t num_workers_;

    bool try_steal(size_t thief_id, Task& task) {
        for (size_t i = 1; i < num_workers_; ++i) {
            size_t victim = (thief_id + i) % num_workers_;
            std::lock_guard<std::mutex> lock(mutexes_[victim]);
            if (!queues_[victim].empty()) {
                task = std::move(queues_[victim].back());  // 뒤에서 훔침
                queues_[victim].pop_back();
                return true;
            }
        }
        return false;
    }

    void worker_loop(size_t id) {
        while (!done_) {
            Task task;
            {
                std::unique_lock<std::mutex> lock(mutexes_[id]);
                cvs_[id].wait_for(lock, std::chrono::milliseconds(1),
                    [this, id] { return done_ || !queues_[id].empty(); });
                if (done_) break;
                if (!queues_[id].empty()) {
                    task = std::move(queues_[id].front());
                    queues_[id].pop_front();
                }
            }
            if (task) {
                task();
            } else if (try_steal(id, task)) {
                task();
            }
        }
    }

public:
    explicit WorkStealingPool(size_t n) : num_workers_(n) {
        queues_.resize(n);
        mutexes_.resize(n);
        cvs_.resize(n);
        for (size_t i = 0; i < n; ++i) {
            workers_.emplace_back(&WorkStealingPool::worker_loop, this, i);
        }
    }

    void submit(size_t preferred_worker, Task task) {
        size_t id = preferred_worker % num_workers_;
        {
            std::lock_guard<std::mutex> lock(mutexes_[id]);
            queues_[id].push_back(std::move(task));
        }
        cvs_[id].notify_one();
    }

    ~WorkStealingPool() {
        done_ = true;
        for (auto& cv : cvs_) cv.notify_all();
        for (auto& w : workers_) w.join();
    }
};

6. 완전한 튜닝 예제

// 완전한 예제: 8스레드 병렬 카운터, 캐시 라인 정렬
#include <atomic>
#include <thread>
#include <vector>
#include <iostream>
#include <chrono>

constexpr size_t CACHE_LINE = 64;

struct AlignedCounter {
    std::atomic<uint64_t> value{0};
    char pad[CACHE_LINE - sizeof(std::atomic<uint64_t>)];
};

int main() {
    const int num_threads = 8;
    const int iterations = 10'000'000;
    std::vector<AlignedCounter> counters(num_threads);
    std::vector<std::thread> threads;

    auto start = std::chrono::high_resolution_clock::now();
    for (int t = 0; t < num_threads; ++t) {
        threads.emplace_back([&counters, t, iterations]() {
            for (int i = 0; i < iterations; ++i) {
                counters[t].value.fetch_add(1, std::memory_order_relaxed);
            }
        });
    }
    for (auto& th : threads) th.join();
    auto end = std::chrono::high_resolution_clock::now();

    uint64_t total = 0;
    for (auto& c : counters) total += c.value.load();
    std::cout << "Total: " << total << ", expected: "
              << (uint64_t)num_threads * iterations << "\n";
    std::cout << "Time: "
              << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()
              << " ms\n";
    return 0;
}

// 벤치마크: 정렬 vs 비정렬 카운터
#include <atomic>
#include <thread>
#include <chrono>
#include <iostream>

void benchmark_counters() {
    const int N = 8;
    const int ITER = 5'000'000;

    // 비정렬
    std::atomic<int> bad[N];
    auto t1 = std::chrono::high_resolution_clock::now();
    std::vector<std::thread> threads;
    for (int i = 0; i < N; ++i)
        threads.emplace_back([&bad, i]() {
            for (int j = 0; j < ITER; ++j) bad[i]++;
        });
    for (auto& t : threads) t.join();
    auto t2 = std::chrono::high_resolution_clock::now();

    // 정렬 (캐시 라인)
    struct Padded { std::atomic<int> v{0}; char pad[60]; } good[N];
    auto t3 = std::chrono::high_resolution_clock::now();
    threads.clear();
    for (int i = 0; i < N; ++i)
        threads.emplace_back([&good, i]() {
            for (int j = 0; j < ITER; ++j) good[i].v++;
        });
    for (auto& t : threads) t.join();
    auto t4 = std::chrono::high_resolution_clock::now();

    std::cout << "Bad (false sharing): "
              << std::chrono::duration_cast<std::chrono::ms>(t2-t1).count() << " ms\n";
    std::cout << "Good (padded):       "
              << std::chrono::duration_cast<std::chrono::ms>(t4-t3).count() << " ms\n";
}

예제 3: Lock-Free SPSC로 로그 버퍼

// 로그 메시지를 SPSC 큐로 비동기 전달
#include <string>
#include <thread>
#include <iostream>

template<size_t N>
class LogBuffer {
    struct Entry {
        char msg[256];
        int level;
    };
    std::array<Entry, N> buffer_;
    std::atomic<size_t> head_{0};
    std::atomic<size_t> tail_{0};

public:
    bool push(const char* msg, int level) {
        size_t t = tail_.load(std::memory_order_relaxed);
        size_t next = (t + 1) % N;
        if (next == head_.load(std::memory_order_acquire)) return false;
        snprintf(buffer_[t].msg, sizeof(buffer_[t].msg), "%s", msg);
        buffer_[t].level = level;
        tail_.store(next, std::memory_order_release);
        return true;
    }

    bool pop(char* out_msg, int& out_level) {
        size_t h = head_.load(std::memory_order_relaxed);
        if (h == tail_.load(std::memory_order_acquire)) return false;
        snprintf(out_msg, 256, "%s", buffer_[h].msg);
        out_level = buffer_[h].level;
        head_.store((h + 1) % N, std::memory_order_release);
        return true;
    }
};

int main() {
    LogBuffer<1024> log_buf;
    std::thread producer([&]() {
        for (int i = 0; i < 100; ++i) {
            char buf[64];
            snprintf(buf, sizeof(buf), "log %d", i);
            while (!log_buf.push(buf, 0)) std::this_thread::yield();
        }
    });
    std::thread consumer([&]() {
        char msg[256];
        int level;
        int count = 0;
        while (count < 100) {
            if (log_buf.pop(msg, level)) {
                std::cout << msg << "\n";
                ++count;
            }
        }
    });
    producer.join();
    consumer.join();
    return 0;
}

7. 자주 발생하는 에러와 해결법

에러 1: “스레드를 늘렸는데 성능이 나빠짐”

원인: 락 경합 또는 false sharing.

해결법:

프로파일러로 락 대기 시간 확인 (perf, VTune)
per-thread 데이터는 alignas(64) 또는 std::hardware_destructive_interference_size로 정렬
락 대신 lock-free 구조 검토

에러 2: “Lock-free 큐에서 데이터가 유실됨”

원인: 메모리 순서 잘못 사용. memory_order_relaxed만 쓰면 쓰기 순서가 보장되지 않아, 소비자가 아직 쓰이지 않은 데이터를 읽을 수 있습니다.

// ❌ 잘못된 예
tail_.store(next, std::memory_order_relaxed);  // 쓰기가 먼저 보일 수 있음

// ✅ 올바른 예
tail_.store(next, std::memory_order_release);  // 이전 쓰기들이 먼저 완료됨

에러 3: “Work stealing에서 데드락”

원인: 여러 큐의 락을 동시에 잡을 때 순서가 일정하지 않으면 데드락 가능.

해결법: 훔칠 때 항상 같은 순서로 락을 잡습니다 (예: victim ID 오름차순). 위 예제에서는 한 번에 하나의 victim 락만 잡으므로 데드락이 없습니다.

에러 4: “std::atomic<T>에서 T가 trivially copyable이 아님”

원인: std::atomic은 trivially copyable 타입만 지원합니다. std::atomic<std::string>은 불가능합니다.

해결법: 포인터를 atomic으로 두거나, std::shared_ptr를 std::atomic_load/atomic_store로 다룹니다.

// ❌ 잘못된 예
std::atomic<std::string> s;  // 컴파일 에러

// ✅ 올바른 예
std::atomic<std::shared_ptr<std::string>> ptr;

에러 5: “스레드 풀 종료 시 작업 유실”

원인: done_ = true로만 두고 notify_all을 하지 않으면, wait 중인 워커가 깨어나지 않습니다.

해결법: 종료 시 반드시 notify_all 호출.

~WorkStealingPool() {
    done_ = true;
    for (auto& cv : cvs_) cv.notify_all();  // 필수!
    for (auto& w : workers_) w.join();
}

에러 6: “compare_exchange_weak 루프에서 무한 반복”

원인: compare_exchange_weak는 spurious failure(가짜 실패)를 반환할 수 있습니다. 단순히 while (!cas(...))만 쓰면 이론적으로 무한 루프에 빠질 수 있습니다.

해결법: 일정 횟수 실패 시 yield 또는 compare_exchange_strong으로 전환.

// ✅ 개선: spurious failure 대비
int retries = 0;
while (!head_.compare_exchange_weak(old_head, new_head,
        std::memory_order_acquire, std::memory_order_relaxed)) {
    if (++retries > 100) {
        std::this_thread::yield();
        retries = 0;
    }
}

에러 7: “조건 변수에서 lost wakeup”

원인: notify_one을 호출한 뒤에 다른 스레드가 wait에 들어가면, 그 알림을 놓칩니다.

해결법: 조건 검사와 wait를 predicate 람다로 묶어서, 깨어난 뒤에도 조건을 다시 확인합니다. wait의 두 번째 인자로 predicate을 주면 됩니다.

cv_.wait(lock, [this] { return !queue_.empty() || done_; });

8. 성능 최적화 팁

팁 1: Critical Section 최소화

락을 잡은 상태에서 하는 일을 최소한으로 줄입니다.

// ❌ 나쁜 예: 락 안에서 무거운 작업
{
    std::lock_guard<std::mutex> lock(m);
    auto data = expensive_computation();  // 락을 잡은 채로 오래 대기
    queue_.push(data);
}

// ✅ 좋은 예: 락 밖에서 계산
auto data = expensive_computation();
{
    std::lock_guard<std::mutex> lock(m);
    queue_.push(std::move(data));
}

팁 2: Read-Many-Write-Few에는 RwLock

읽기가 압도적으로 많고 쓰기가 적으면 std::shared_mutex(C++17)로 읽기 락을 공유합니다.

#include <shared_mutex>

std::shared_mutex rw_mutex;
std::vector<int> data;

void reader() {
    std::shared_lock lock(rw_mutex);  // 여러 스레드 동시 읽기 가능
    use(data);
}

void writer() {
    std::unique_lock lock(rw_mutex);  // 쓰기 시 독점
    data.push_back(42);
}

팁 3: 스핀락은 짧은 크리티컬 섹션에만

대기 시간이 매우 짧을 때만 std::atomic_flag 스핀락을 고려합니다. 길면 std::mutex가 나을 수 있습니다.

class SpinLock {
    std::atomic_flag flag_ = ATOMIC_FLAG_INIT;
public:
    void lock() {
        while (flag_.test_and_set(std::memory_order_acquire)) {
            // 짧은 대기만 스핀
        }
    }
    void unlock() {
        flag_.clear(std::memory_order_release);
    }
};

팁 4: 메모리 할당 최소화

락 프리 큐에서 매번 new/delete를 하면 할당자가 병목이 됩니다. 메모리 풀 또는 미리 할당된 노드를 재사용합니다.

// 메모리 풀 예시: 노드 재사용
template<typename T>
class LockFreeStackWithPool {
    struct Node {
        T data;
        Node* next;
    };
    std::atomic<Node*> head_{nullptr};
    std::atomic<Node*> pool_{nullptr};  // 반환된 노드 풀

    Node* alloc_node(const T& value) {
        Node* n = pool_.exchange(nullptr);
        if (!n) n = new Node{value, nullptr};
        else { n->data = value; n->next = nullptr; }
        return n;
    }
    void free_node(Node* n) {
        n->next = pool_.load();
        while (!pool_.compare_exchange_weak(n->next, n)) {}
    }
    // push/pop에서 alloc_node, free_node 사용
};

팁 5: NUMA 인지 (서버급 시스템)

NUMA(Non-Uniform Memory Access) 머신에서는 스레드를 특정 노드에 고정하면 메모리 접근 지연이 줄어듭니다. numactl 또는 pthread_setaffinity_np로 CPU affinity를 설정합니다.

#ifdef __linux__
#include <pthread.h>
void pin_to_cpu(int cpu_id) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(cpu_id, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
}
#endif

성능 비교 요약

기법	적용 상황	예상 효과
False sharing 제거	per-thread 카운터/버퍼	2~10배 (스레드 수에 비례)
Lock-free 큐	고빈도 enqueue/dequeue	락 경합 제거, 1.5~3배
Work stealing	작업 크기 불균형	부하 분산, 1.2~2배
배치 처리	다수 소규모 작업	락 횟수 감소, 1.3~2배
RwLock	읽기 >> 쓰기	읽기 병렬화, 읽기 2~4배

9. 프로덕션 패턴

패턴 1: Graceful Shutdown

작업 중인 항목을 마저 처리하고 스레드를 종료합니다.

void shutdown() {
    stop_requested_ = true;
    cv_.notify_all();
    for (auto& w : workers_) {
        if (w.joinable()) w.join();
    }
}

패턴 2: 백프레셔(Backpressure)

큐가 가득 차면 생산자를 블로킹해 메모리 폭증을 막습니다.

void push_with_backpressure(Task task) {
    std::unique_lock lock(mutex_);
    cv_full_.wait(lock, [this] { return queue_.size() < max_size_; });
    queue_.push(std::move(task));
    cv_empty_.notify_one();
}

패턴 3: 우선순위 큐

긴급 작업을 먼저 처리하려면 우선순위 큐를 사용합니다.

using PrioTask = std::pair<int, std::function<void()>>;
std::priority_queue<PrioTask, std::vector<PrioTask>, std::greater<>> queue_;
// 우선순위 숫자가 작을수록 먼저 실행

패턴 4: 헬스 체크와 모니터링

struct PoolStats {
    size_t queue_size;
    size_t active_tasks;
    size_t completed_tasks;
};

PoolStats get_stats() const {
    std::lock_guard lock(mutex_);
    return {queue_.size(), active_, completed_};
}

패턴 5: 작업 타임아웃

오래 걸리는 작업으로 워커가 묶이지 않도록 타임아웃을 둡니다.

template<typename F>
bool run_with_timeout(F&& f, std::chrono::milliseconds timeout) {
    std::packaged_task<bool()> task([f = std::forward<F>(f)]() {
        f();
        return true;
    });
    auto future = task.get_future();
    std::thread t(std::move(task));
    if (future.wait_for(timeout) == std::future_status::timeout) {
        // 타임아웃: 스레드 detach 또는 취소 플래그 설정
        t.detach();
        return false;
    }
    t.join();
    return future.get();
}

패턴 6: CPU 바운드 vs I/O 바운드 풀 분리

CPU 집약적 작업과 I/O 대기 작업을 별도 풀로 분리하면, I/O 대기 중에 CPU 워커가 블로킹되지 않습니다.

// CPU 풀: 코어 수
WorkStealingPool cpu_pool{std::thread::hardware_concurrency()};

// I/O 풀: 코어 수의 2~4배 (대기 시간 활용)
WorkStealingPool io_pool{std::thread::hardware_concurrency() * 2};

void handle_request(Request req) {
    io_pool.submit(0, [req]() {
        auto data = fetch_from_db(req);  // I/O 대기
        cpu_pool.submit(0, [data]() {
            process(data);  // CPU 연산
        });
    });
}

패턴 7: 스레드 로컬 저장소 활용

스레드마다 독립적인 캐시·버퍼를 두어 락 없이 재사용합니다.

thread_local std::vector<int> thread_local_buffer;

void process_item(int x) {
    thread_local_buffer.clear();
    thread_local_buffer.push_back(x);
    // 이 스레드만 접근하므로 락 불필요
    do_work(thread_local_buffer);
}

10. 정리 및 체크리스트

요약 표

항목	설명
False Sharing	캐시 라인 정렬(`alignas(64)`)로 per-thread 데이터 분리
Lock-Free	`std::atomic` + CAS, 메모리 순서 주의
Work Stealing	워커당 전용 큐, 비어 있으면 다른 큐에서 훔침
스레드 풀	풀 크기 = 코어 수(CPU) 또는 2배(I/O), 배치로 락 감소

핵심 원칙

병목 먼저 측정: 프로파일러로 락 대기·캐시 미스 확인
False sharing 제거: per-thread 데이터 캐시 라인 정렬
Critical section 최소화: 락 밖에서 계산
적절한 구조 선택: 단순 락 → RwLock → Lock-free 순으로 검토

구현 체크리스트

per-thread 카운터/버퍼에 alignas(CACHE_LINE_SIZE) 적용
Lock-free 구조에서 memory_order_release/acquire 사용
Work stealing 시 락 순서 일정하게 유지 (데드락 방지)
스레드 풀 종료 시 notify_all 호출
큐 크기 제한으로 백프레셔 구현
프로덕션에서 풀 크기·큐 크기 모니터링

자주 묻는 질문 (FAQ)

Q. 이 내용을 실무에서 언제 쓰나요?

A. 고성능 서버, 실시간 시스템, 병렬 알고리즘, 게임 엔진 등 멀티코어 활용이 중요한 시스템 실무에서는 위 본문의 예제와 선택 가이드를 참고해 적용하면 됩니다.

Q. 선행으로 읽으면 좋은 글은?

A. 각 글 하단의 이전 글 링크를 따라가면 순서대로 배울 수 있습니다. C++ 시리즈 목차에서 전체 흐름을 확인할 수 있습니다.

Q. 더 깊이 공부하려면?

A. cppreference와 해당 라이브러리 공식 문서를 참고하세요. 글 말미의 참고 자료 링크도 활용하면 좋습니다.

한 줄 요약: 락 프리·작업 훔치기·스레드 풀 최적화·false sharing 방지를 마스터하면 멀티코어 성능을 극대화할 수 있습니다.

C++ 스레드 풀 완벽 가이드 | 작업 큐·병렬 처리·성능 벤치마크 [#51-3]
C++ 메모리 순서(Memory Ordering) 완벽 가이드 | relaxed·acquire/release