C++ Parallel Algorithms | Parallel Algorithm Guide

C++ Parallel Algorithms | Parallel Algorithm Guide

이 글의 핵심

A practical guide to C++ Parallel Algorithms.

Entering

C++17 Parallel Algorithms take advantage of multicore CPUs to *automatically parallelize standard algorithms. You can execute std::sort, std::transform, etc. in parallel simply by adding std::execution::par.


1. Execution Policy

Execution policy type```cpp

#include #include #include

int main() { std::vector v = {3, 1, 4, 1, 5, 9, 2, 6};

// 1. seq: sequential execution (default) std::sort(std::execution::seq, v.begin(), v.end());

// 2. par: Parallel execution std::sort(std::execution::par, v.begin(), v.end());

// 3. par_unseq: parallel + vectorization std::sort(std::execution::par_unseq, v.begin(), v.end());

return 0;

}


| policy | Description | Features | When to use |
|------|-----|------|---------|
| seq | sequential execution | single thread | small data |
| par | parallel execution | multithreaded | Big data, independent computation |
| par_unseq | parallel + vectorized | SIMD + multithreaded | Big data, simple operations |

---

## 2. Parallel sorting

### Default sorting```cpp
#include <algorithm>
#include <execution>
#include <vector>
#include <iostream>
#include <chrono>

void benchmarkSort() {
    std::vector<int> v1(10000000);
    std::generate(v1.begin(), v1.end(), std::rand);
    
    auto v2 = v1;
    
    // Sequential Sorting
    auto start = std::chrono::high_resolution_clock::now();
    std::sort(v1.begin(), v1.end());
    auto end = std::chrono::high_resolution_clock::now();
    auto seq_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    
    // parallel sort
    start = std::chrono::high_resolution_clock::now();
    std::sort(std::execution::par, v2.begin(), v2.end());
    end = std::chrono::high_resolution_clock::now();
    auto par_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    
std::cout << "Sort sequence: " << seq_time.count() << "ms" << std::endl;
std::cout << "Parallel sort: " << par_time.count() << "ms" << std::endl;
std::cout << "Speedup: " << (double)seq_time.count() / par_time.count() << "x" << std::endl;
}

int main() {
    benchmarkSort();
    return 0;
}
```Output (4 cores):```
Sequential Sorting: 2341 ms
Parallel alignment: 687 ms
Speedup: 3.4x
```---

## 3. Parallel conversion

### transform```cpp
#include <algorithm>
#include <execution>
#include <vector>
#include <cmath>
#include <iostream>

int main() {
    std::vector<double> data(10000000);
    std::iota(data.begin(), data.end(), 1.0);
    
    // Parallel square root calculation
    std::transform(std::execution::par, 
        data.begin(), data.end(), data.begin(),
         { return std::sqrt(x); });
    
std::cout << "First 5: ";
    for (int i = 0; i < 5; ++i) {
        std::cout << data[i] << " ";
    }
    std::cout << std::endl;
    
    return 0;
}
```output of power:```
First 5: 1 1.41421 1.73205 2 2.23607

for_each

#include <algorithm>
#include <execution>
#include <vector>
#include <iostream>

int main() {
    std::vector<int> v(1000000);
    std::iota(v.begin(), v.end(), 1);
    
// parallel for_each
    std::for_each(std::execution::par, v.begin(), v.end(), 
         {
x = x * x;  // square
        });
    
    std::cout << "v[99]: " << v[99] << std::endl;  // 10000
    
    return 0;
}
```---

## 4. Parallel aggregation

### reduce```cpp
#include <numeric>
#include <execution>
#include <vector>
#include <iostream>

int main() {
    std::vector<int> v(10000000);
    std::iota(v.begin(), v.end(), 1);
    
    // Parallel reduce(sum)
    long long sum = std::reduce(std::execution::par, 
        v.begin(), v.end(), 0LL);
    
std::cout << "sum: " << sum << std::endl;
    
    return 0;
}
```**output of power:**```
Total: 50000005000000

transform_reduce

#include <numeric>
#include <execution>
#include <vector>
#include <iostream>

int main() {
    std::vector<int> v(1000000);
    std::iota(v.begin(), v.end(), 1);
    
// parallel sum of squares
    long long sum_of_squares = std::transform_reduce(
        std::execution::par,
        v.begin(), v.end(),
        0LL,
        std::plus<>(),
         { return static_cast<long long>(x) * x; }
    );
    
std::cout << "sum of squares: " << sum_of_squares << std::endl;
    
    return 0;
}
```---

## 5. Parallel search

### find```cpp
#include <algorithm>
#include <execution>
#include <vector>
#include <iostream>

int main() {
    std::vector<int> v(10000000);
    std::iota(v.begin(), v.end(), 1);
    
    // parallel find
    auto it = std::find(std::execution::par, v.begin(), v.end(), 5000000);
    
    if (it != v.end()) {
std::cout << "Find: " << *it << std::endl;
std::cout << "Index: " << std::distance(v.begin(), it) << std::endl;
    }
    
    return 0;
}

count_if

#include <algorithm>
#include <execution>
#include <vector>
#include <iostream>

int main() {
    std::vector<int> v(10000000);
    std::generate(v.begin(), v.end(), std::rand);
    
    // parallel count_if(even count)
    auto count = std::count_if(std::execution::par, v.begin(), v.end(),
         { return x % 2 == 0; });
    
std::cout << "Even number: " << count << std::endl;
    
    return 0;
}
```---

## 6. Frequently occurring problems

### Problem 1: Small data```cpp
#include <algorithm>
#include <execution>
#include <vector>
#include <chrono>
#include <iostream>

void benchmarkSmallData() {
    std::vector<int> small(100);
    std::generate(small.begin(), small.end(), std::rand);
    
    auto v1 = small;
    auto v2 = small;
    
// sequential
    auto start = std::chrono::high_resolution_clock::now();
    std::sort(v1.begin(), v1.end());
    auto end = std::chrono::high_resolution_clock::now();
    auto seq_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
    
// parallel (overhead > gain)
    start = std::chrono::high_resolution_clock::now();
    std::sort(std::execution::par, v2.begin(), v2.end());
    end = std::chrono::high_resolution_clock::now();
    auto par_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
    
std::cout << "Sequence: " << seq_time.count() << "μs" << std::endl;
std::cout << "Parallel: " << par_time.count() << "μs" << std::endl;
}

int main() {
    benchmarkSmallData();
// Sequence: 5μs
// Parallel: 150μs (overhead)
    
    return 0;
}
```### Problem 2: Shared State (Data Race)```cpp
#include <algorithm>
#include <execution>
#include <vector>
#include <iostream>

int main() {
    std::vector<int> v(1000000);
    std::iota(v.begin(), v.end(), 1);
    
    int sum = 0;
    
    // ❌ Data Race
    std::for_each(std::execution::par, v.begin(), v.end(), [&](int x) {
        sum += x;  // Multiple threads access sum simultaneously
    });
    
std::cout << "Invalid sum: " << sum << std::endl;  //Not as expected
    
    // ✅ Use reduce
    long long correct_sum = std::reduce(std::execution::par, v.begin(), v.end(), 0LL);
    
std::cout << "Correct sum: " << correct_sum << std::endl;
    
    return 0;
}
```### Issue 3: Exception handling```cpp
#include <algorithm>
#include <execution>
#include <vector>
#include <iostream>

int main() {
    std::vector<int> v = {1, 2, -3, 4, -5};
    
    try {
// Exception during parallel execution
        std::for_each(std::execution::par, v.begin(), v.end(),  {
            if (x < 0) {
throw std::runtime_error("Negative number found");
            }
        });
    } catch (const std::exception& e) {
// Exception can occur in multiple threads
// Only the first exception is caught
std::cout << "Exception: " << e.what() << std::endl;
    }
    
    return 0;
}
```### Problem 4: Order-dependent operations```cpp
#include <algorithm>
#include <execution>
#include <vector>
#include <iostream>

int main() {
    std::vector<int> v = {1, 2, 3, 4, 5};
    
    // ❌ Order dependent (lace)
    std::for_each(std::execution::par, v.begin(), v.end(), [&](int& x) {
        x = v[0] + 1;  // v[0] read by multiple threads (dangerous)
    });
    
    // ✅ Independent operations
    std::for_each(std::execution::par, v.begin(), v.end(),  {
        x = x * 2;  // Each element is independent
    });
    
    for (int x : v) std::cout << x << " ";
    std::cout << std::endl;
    
    return 0;
}
```---

## 7. Supported Algorithms

### Main Algorithm```cpp
#include <algorithm>
#include <numeric>
#include <execution>

// sort
std::sort(std::execution::par, v.begin(), v.end());
std::stable_sort(std::execution::par, v.begin(), v.end());
std::partial_sort(std::execution::par, v.begin(), v.begin() + 10, v.end());

// search
std::find(std::execution::par, v.begin(), v.end(), 42);
std::find_if(std::execution::par, v.begin(), v.end(), pred);
std::binary_search(std::execution::par, v.begin(), v.end(), 42);

// conversion
std::transform(std::execution::par, v.begin(), v.end(), v.begin(), func);
std::for_each(std::execution::par, v.begin(), v.end(), func);

// tally
std::reduce(std::execution::par, v.begin(), v.end(), 0);
std::transform_reduce(std::execution::par, v.begin(), v.end(), 0, plus, func);

// copy
std::copy(std::execution::par, v1.begin(), v1.end(), v2.begin());
std::copy_if(std::execution::par, v1.begin(), v1.end(), v2.begin(), pred);

// etc
std::count(std::execution::par, v.begin(), v.end(), 42);
std::count_if(std::execution::par, v.begin(), v.end(), pred);
std::all_of(std::execution::par, v.begin(), v.end(), pred);
std::any_of(std::execution::par, v.begin(), v.end(), pred);
```---

## 8. Practical example: image processing```cpp
#include <algorithm>
#include <execution>
#include <vector>
#include <cmath>
#include <iostream>

struct Pixel {
    unsigned char r, g, b;
};

class ImageProcessor {
    std::vector<Pixel> pixels;
    int width, height;
    
public:
    ImageProcessor(int w, int h) : width(w), height(h) {
        pixels.resize(w * h);
    }
    
    // Parallel grayscale conversion
    void toGrayscale() {
        std::for_each(std::execution::par, pixels.begin(), pixels.end(),
             {
                unsigned char gray = static_cast<unsigned char>(
                    0.299 * p.r + 0.587 * p.g + 0.114 * p.b
                );
                p.r = p.g = p.b = gray;
            });
    }
    
    // Parallel brightness adjustment
    void adjustBrightness(int delta) {
        std::for_each(std::execution::par, pixels.begin(), pixels.end(),
            [delta](Pixel& p) {
                auto clamp =  {
                    return std::max(0, std::min(255, val));
                };
                
                p.r = clamp(p.r + delta);
                p.g = clamp(p.g + delta);
                p.b = clamp(p.b + delta);
            });
    }
    
    // Parallel Blur (Simple Average)
    void blur() {
        auto original = pixels;
        
        std::for_each(std::execution::par, pixels.begin(), pixels.end(),
            [&](Pixel& p) {
                int idx = &p - &pixels[0];
                int x = idx % width;
                int y = idx / width;
                
                // 3x3 average (simplified)
                int r_sum = 0, g_sum = 0, b_sum = 0, count = 0;
                
                for (int dy = -1; dy <= 1; ++dy) {
                    for (int dx = -1; dx <= 1; ++dx) {
                        int nx = x + dx;
                        int ny = y + dy;
                        
                        if (nx >= 0 && nx < width && ny >= 0 && ny < height) {
                            int nidx = ny * width + nx;
                            r_sum += original[nidx].r;
                            g_sum += original[nidx].g;
                            b_sum += original[nidx].b;
                            ++count;
                        }
                    }
                }
                
                p.r = r_sum / count;
                p.g = g_sum / count;
                p.b = b_sum / count;
            });
    }
    
    // parallel pixel count
    int countBrightPixels(int threshold) {
        return std::count_if(std::execution::par, pixels.begin(), pixels.end(),
            [threshold](const Pixel& p) {
                int brightness = (p.r + p.g + p.b) / 3;
                return brightness > threshold;
            });
    }
};

int main() {
    ImageProcessor img(1920, 1080);
    
    img.toGrayscale();
    img.adjustBrightness(20);
    img.blur();
    
    int bright = img.countBrightPixels(128);
std::cout << "Bright pixel: " << bright << std::endl;
    
    return 0;
}
```---

## Cleanup

### Key takeaways

1. **Parallel algorithm**: C++17 multi-core utilization
2. **Execution Policy**: seq, par, par_unseq
3. **Performance**: 2-4x improvement on large data
4. **Caution**: No data race or order dependency.
5. **Application**: More than 10,000, independent calculations

### Parallel algorithm effects

| data size | computational complexity | parallel effect | Recommended Policy |
|-----------|-----------|---------|---------|
| < 1,000 | low | None | seq |
| 1,000-10,000 | low | low | seq |
| > 10,000 | low | middle | par |
| > 10,000 | High | High | par |
| > 100,000 | Simple | very high | par_unseq |

### Practical tips

**Principle of use:**
- Only used for large data (more than 10,000 items)
- Effective for computationally intensive operations
- Parallelize only independent operations
- Avoid shared states

**Performance:**
- Measure effectiveness through profiling
- Small data has large overhead
- Memory-intensive operations are less effective
- Effects vary depending on the number of CPU cores

**Caution:**
- Data race prevention (`std::atomic`, `reduce`)
- Careful handling of exceptions (possible `std::terminate`)
- Prohibit order-dependent operations
- Difficult to debug (using TSan)

### Next steps

- [C++ Execution Policy](/blog/cpp-execution-policy/)
- C++ Thread
- C++ Atomic

---

## Related articles

- [C++ Algorithm Sort | ](/blog/cpp-algorithm-sort/)
- [Arrays and lists | Complete summary of essential data structures for coding tests](/blog/algorithm-series-01-array-list/)