C++ Parallel Algorithms | Parallel Algorithm Guide
이 글의 핵심
A practical guide to C++ Parallel Algorithms.
Entering
C++17 Parallel Algorithms take advantage of multicore CPUs to *automatically parallelize standard algorithms. You can execute std::sort, std::transform, etc. in parallel simply by adding std::execution::par.
1. Execution Policy
Execution policy type```cpp
#include
int main() {
std::vector
// 1. seq: sequential execution (default) std::sort(std::execution::seq, v.begin(), v.end());
// 2. par: Parallel execution std::sort(std::execution::par, v.begin(), v.end());
// 3. par_unseq: parallel + vectorization std::sort(std::execution::par_unseq, v.begin(), v.end());
return 0;
}
| policy | Description | Features | When to use |
|------|-----|------|---------|
| seq | sequential execution | single thread | small data |
| par | parallel execution | multithreaded | Big data, independent computation |
| par_unseq | parallel + vectorized | SIMD + multithreaded | Big data, simple operations |
---
## 2. Parallel sorting
### Default sorting```cpp
#include <algorithm>
#include <execution>
#include <vector>
#include <iostream>
#include <chrono>
void benchmarkSort() {
std::vector<int> v1(10000000);
std::generate(v1.begin(), v1.end(), std::rand);
auto v2 = v1;
// Sequential Sorting
auto start = std::chrono::high_resolution_clock::now();
std::sort(v1.begin(), v1.end());
auto end = std::chrono::high_resolution_clock::now();
auto seq_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
// parallel sort
start = std::chrono::high_resolution_clock::now();
std::sort(std::execution::par, v2.begin(), v2.end());
end = std::chrono::high_resolution_clock::now();
auto par_time = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << "Sort sequence: " << seq_time.count() << "ms" << std::endl;
std::cout << "Parallel sort: " << par_time.count() << "ms" << std::endl;
std::cout << "Speedup: " << (double)seq_time.count() / par_time.count() << "x" << std::endl;
}
int main() {
benchmarkSort();
return 0;
}
```Output (4 cores):```
Sequential Sorting: 2341 ms
Parallel alignment: 687 ms
Speedup: 3.4x
```---
## 3. Parallel conversion
### transform```cpp
#include <algorithm>
#include <execution>
#include <vector>
#include <cmath>
#include <iostream>
int main() {
std::vector<double> data(10000000);
std::iota(data.begin(), data.end(), 1.0);
// Parallel square root calculation
std::transform(std::execution::par,
data.begin(), data.end(), data.begin(),
{ return std::sqrt(x); });
std::cout << "First 5: ";
for (int i = 0; i < 5; ++i) {
std::cout << data[i] << " ";
}
std::cout << std::endl;
return 0;
}
```output of power:```
First 5: 1 1.41421 1.73205 2 2.23607
for_each
#include <algorithm>
#include <execution>
#include <vector>
#include <iostream>
int main() {
std::vector<int> v(1000000);
std::iota(v.begin(), v.end(), 1);
// parallel for_each
std::for_each(std::execution::par, v.begin(), v.end(),
{
x = x * x; // square
});
std::cout << "v[99]: " << v[99] << std::endl; // 10000
return 0;
}
```---
## 4. Parallel aggregation
### reduce```cpp
#include <numeric>
#include <execution>
#include <vector>
#include <iostream>
int main() {
std::vector<int> v(10000000);
std::iota(v.begin(), v.end(), 1);
// Parallel reduce(sum)
long long sum = std::reduce(std::execution::par,
v.begin(), v.end(), 0LL);
std::cout << "sum: " << sum << std::endl;
return 0;
}
```**output of power:**```
Total: 50000005000000
transform_reduce
#include <numeric>
#include <execution>
#include <vector>
#include <iostream>
int main() {
std::vector<int> v(1000000);
std::iota(v.begin(), v.end(), 1);
// parallel sum of squares
long long sum_of_squares = std::transform_reduce(
std::execution::par,
v.begin(), v.end(),
0LL,
std::plus<>(),
{ return static_cast<long long>(x) * x; }
);
std::cout << "sum of squares: " << sum_of_squares << std::endl;
return 0;
}
```---
## 5. Parallel search
### find```cpp
#include <algorithm>
#include <execution>
#include <vector>
#include <iostream>
int main() {
std::vector<int> v(10000000);
std::iota(v.begin(), v.end(), 1);
// parallel find
auto it = std::find(std::execution::par, v.begin(), v.end(), 5000000);
if (it != v.end()) {
std::cout << "Find: " << *it << std::endl;
std::cout << "Index: " << std::distance(v.begin(), it) << std::endl;
}
return 0;
}
count_if
#include <algorithm>
#include <execution>
#include <vector>
#include <iostream>
int main() {
std::vector<int> v(10000000);
std::generate(v.begin(), v.end(), std::rand);
// parallel count_if(even count)
auto count = std::count_if(std::execution::par, v.begin(), v.end(),
{ return x % 2 == 0; });
std::cout << "Even number: " << count << std::endl;
return 0;
}
```---
## 6. Frequently occurring problems
### Problem 1: Small data```cpp
#include <algorithm>
#include <execution>
#include <vector>
#include <chrono>
#include <iostream>
void benchmarkSmallData() {
std::vector<int> small(100);
std::generate(small.begin(), small.end(), std::rand);
auto v1 = small;
auto v2 = small;
// sequential
auto start = std::chrono::high_resolution_clock::now();
std::sort(v1.begin(), v1.end());
auto end = std::chrono::high_resolution_clock::now();
auto seq_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
// parallel (overhead > gain)
start = std::chrono::high_resolution_clock::now();
std::sort(std::execution::par, v2.begin(), v2.end());
end = std::chrono::high_resolution_clock::now();
auto par_time = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
std::cout << "Sequence: " << seq_time.count() << "μs" << std::endl;
std::cout << "Parallel: " << par_time.count() << "μs" << std::endl;
}
int main() {
benchmarkSmallData();
// Sequence: 5μs
// Parallel: 150μs (overhead)
return 0;
}
```### Problem 2: Shared State (Data Race)```cpp
#include <algorithm>
#include <execution>
#include <vector>
#include <iostream>
int main() {
std::vector<int> v(1000000);
std::iota(v.begin(), v.end(), 1);
int sum = 0;
// ❌ Data Race
std::for_each(std::execution::par, v.begin(), v.end(), [&](int x) {
sum += x; // Multiple threads access sum simultaneously
});
std::cout << "Invalid sum: " << sum << std::endl; //Not as expected
// ✅ Use reduce
long long correct_sum = std::reduce(std::execution::par, v.begin(), v.end(), 0LL);
std::cout << "Correct sum: " << correct_sum << std::endl;
return 0;
}
```### Issue 3: Exception handling```cpp
#include <algorithm>
#include <execution>
#include <vector>
#include <iostream>
int main() {
std::vector<int> v = {1, 2, -3, 4, -5};
try {
// Exception during parallel execution
std::for_each(std::execution::par, v.begin(), v.end(), {
if (x < 0) {
throw std::runtime_error("Negative number found");
}
});
} catch (const std::exception& e) {
// Exception can occur in multiple threads
// Only the first exception is caught
std::cout << "Exception: " << e.what() << std::endl;
}
return 0;
}
```### Problem 4: Order-dependent operations```cpp
#include <algorithm>
#include <execution>
#include <vector>
#include <iostream>
int main() {
std::vector<int> v = {1, 2, 3, 4, 5};
// ❌ Order dependent (lace)
std::for_each(std::execution::par, v.begin(), v.end(), [&](int& x) {
x = v[0] + 1; // v[0] read by multiple threads (dangerous)
});
// ✅ Independent operations
std::for_each(std::execution::par, v.begin(), v.end(), {
x = x * 2; // Each element is independent
});
for (int x : v) std::cout << x << " ";
std::cout << std::endl;
return 0;
}
```---
## 7. Supported Algorithms
### Main Algorithm```cpp
#include <algorithm>
#include <numeric>
#include <execution>
// sort
std::sort(std::execution::par, v.begin(), v.end());
std::stable_sort(std::execution::par, v.begin(), v.end());
std::partial_sort(std::execution::par, v.begin(), v.begin() + 10, v.end());
// search
std::find(std::execution::par, v.begin(), v.end(), 42);
std::find_if(std::execution::par, v.begin(), v.end(), pred);
std::binary_search(std::execution::par, v.begin(), v.end(), 42);
// conversion
std::transform(std::execution::par, v.begin(), v.end(), v.begin(), func);
std::for_each(std::execution::par, v.begin(), v.end(), func);
// tally
std::reduce(std::execution::par, v.begin(), v.end(), 0);
std::transform_reduce(std::execution::par, v.begin(), v.end(), 0, plus, func);
// copy
std::copy(std::execution::par, v1.begin(), v1.end(), v2.begin());
std::copy_if(std::execution::par, v1.begin(), v1.end(), v2.begin(), pred);
// etc
std::count(std::execution::par, v.begin(), v.end(), 42);
std::count_if(std::execution::par, v.begin(), v.end(), pred);
std::all_of(std::execution::par, v.begin(), v.end(), pred);
std::any_of(std::execution::par, v.begin(), v.end(), pred);
```---
## 8. Practical example: image processing```cpp
#include <algorithm>
#include <execution>
#include <vector>
#include <cmath>
#include <iostream>
struct Pixel {
unsigned char r, g, b;
};
class ImageProcessor {
std::vector<Pixel> pixels;
int width, height;
public:
ImageProcessor(int w, int h) : width(w), height(h) {
pixels.resize(w * h);
}
// Parallel grayscale conversion
void toGrayscale() {
std::for_each(std::execution::par, pixels.begin(), pixels.end(),
{
unsigned char gray = static_cast<unsigned char>(
0.299 * p.r + 0.587 * p.g + 0.114 * p.b
);
p.r = p.g = p.b = gray;
});
}
// Parallel brightness adjustment
void adjustBrightness(int delta) {
std::for_each(std::execution::par, pixels.begin(), pixels.end(),
[delta](Pixel& p) {
auto clamp = {
return std::max(0, std::min(255, val));
};
p.r = clamp(p.r + delta);
p.g = clamp(p.g + delta);
p.b = clamp(p.b + delta);
});
}
// Parallel Blur (Simple Average)
void blur() {
auto original = pixels;
std::for_each(std::execution::par, pixels.begin(), pixels.end(),
[&](Pixel& p) {
int idx = &p - &pixels[0];
int x = idx % width;
int y = idx / width;
// 3x3 average (simplified)
int r_sum = 0, g_sum = 0, b_sum = 0, count = 0;
for (int dy = -1; dy <= 1; ++dy) {
for (int dx = -1; dx <= 1; ++dx) {
int nx = x + dx;
int ny = y + dy;
if (nx >= 0 && nx < width && ny >= 0 && ny < height) {
int nidx = ny * width + nx;
r_sum += original[nidx].r;
g_sum += original[nidx].g;
b_sum += original[nidx].b;
++count;
}
}
}
p.r = r_sum / count;
p.g = g_sum / count;
p.b = b_sum / count;
});
}
// parallel pixel count
int countBrightPixels(int threshold) {
return std::count_if(std::execution::par, pixels.begin(), pixels.end(),
[threshold](const Pixel& p) {
int brightness = (p.r + p.g + p.b) / 3;
return brightness > threshold;
});
}
};
int main() {
ImageProcessor img(1920, 1080);
img.toGrayscale();
img.adjustBrightness(20);
img.blur();
int bright = img.countBrightPixels(128);
std::cout << "Bright pixel: " << bright << std::endl;
return 0;
}
```---
## Cleanup
### Key takeaways
1. **Parallel algorithm**: C++17 multi-core utilization
2. **Execution Policy**: seq, par, par_unseq
3. **Performance**: 2-4x improvement on large data
4. **Caution**: No data race or order dependency.
5. **Application**: More than 10,000, independent calculations
### Parallel algorithm effects
| data size | computational complexity | parallel effect | Recommended Policy |
|-----------|-----------|---------|---------|
| < 1,000 | low | None | seq |
| 1,000-10,000 | low | low | seq |
| > 10,000 | low | middle | par |
| > 10,000 | High | High | par |
| > 100,000 | Simple | very high | par_unseq |
### Practical tips
**Principle of use:**
- Only used for large data (more than 10,000 items)
- Effective for computationally intensive operations
- Parallelize only independent operations
- Avoid shared states
**Performance:**
- Measure effectiveness through profiling
- Small data has large overhead
- Memory-intensive operations are less effective
- Effects vary depending on the number of CPU cores
**Caution:**
- Data race prevention (`std::atomic`, `reduce`)
- Careful handling of exceptions (possible `std::terminate`)
- Prohibit order-dependent operations
- Difficult to debug (using TSan)
### Next steps
- [C++ Execution Policy](/blog/cpp-execution-policy/)
- C++ Thread
- C++ Atomic
---
## Related articles
- [C++ Algorithm Sort | ](/blog/cpp-algorithm-sort/)
- [Arrays and lists | Complete summary of essential data structures for coding tests](/blog/algorithm-series-01-array-list/)