![](/img/trans.png)
[英]c++ std library parallel execution with for_each slower than sequential loop
[英]Parallel for_each more than two times slower than std::for_each
我正在閱讀Anthony Williams 撰寫的C ++ Concurrency in Action 。 在有關設計並發代碼的章節中,有並行版本的std :: for_each algorihtm。 這是本書中經過稍微修改的代碼:
join_thread.hpp
#pragma once
#include <vector>
#include <thread>
class join_threads
{
public:
explicit join_threads(std::vector<std::thread>& threads)
: threads_(threads) {}
~join_threads()
{
for (size_t i = 0; i < threads_.size(); ++i)
{
if(threads_[i].joinable())
{
threads_[i].join();
}
}
}
private:
std::vector<std::thread>& threads_;
};
parallel_for_each.hpp
#pragma once
#include <future>
#include <algorithm>
#include "join_threads.hpp"
template<typename Iterator, typename Func>
void parallel_for_each(Iterator first, Iterator last, Func func)
{
const auto length = std::distance(first, last);
if (0 == length) return;
const auto min_per_thread = 25u;
const unsigned max_threads = (length + min_per_thread - 1) / min_per_thread;
const auto hardware_threads = std::thread::hardware_concurrency();
const auto num_threads= std::min(hardware_threads != 0 ?
hardware_threads : 2u, max_threads);
const auto block_size = length / num_threads;
std::vector<std::future<void>> futures(num_threads - 1);
std::vector<std::thread> threads(num_threads-1);
join_threads joiner(threads);
auto block_start = first;
for (unsigned i = 0; i < num_threads - 1; ++i)
{
auto block_end = block_start;
std::advance(block_end, block_size);
std::packaged_task<void (void)> task([block_start, block_end, func]()
{
std::for_each(block_start, block_end, func);
});
futures[i] = task.get_future();
threads[i] = std::thread(std::move(task));
block_start = block_end;
}
std::for_each(block_start, last, func);
for (size_t i = 0; i < num_threads - 1; ++i)
{
futures[i].get();
}
}
我使用以下程序使用順序版本的std :: for_each對它進行了基准測試:
main.cpp
#include <iostream>
#include <random>
#include <chrono>
#include "parallel_for_each.hpp"
using namespace std;
constexpr size_t ARRAY_SIZE = 500'000'000;
typedef std::vector<uint64_t> Array;
template <class FE, class F>
void test_for_each(const Array& a, FE fe, F f, atomic<uint64_t>& result)
{
auto time_begin = chrono::high_resolution_clock::now();
result = 0;
fe(a.begin(), a.end(), f);
auto time_end = chrono::high_resolution_clock::now();
cout << "Result = " << result << endl;
cout << "Time: " << chrono::duration_cast<chrono::milliseconds>(
time_end - time_begin).count() << endl;
}
int main()
{
random_device device;
default_random_engine engine(device());
uniform_int_distribution<uint8_t> distribution(0, 255);
Array a;
a.reserve(ARRAY_SIZE);
cout << "Generating array ... " << endl;
for (size_t i = 0; i < ARRAY_SIZE; ++i)
a.push_back(distribution(engine));
atomic<uint64_t> result;
auto acc = [&result](uint64_t value) { result += value; };
cout << "parallel_for_each ..." << endl;
test_for_each(a, parallel_for_each<Array::const_iterator, decltype(acc)>, acc, result);
cout << "for_each ..." << endl;
test_for_each(a, for_each<Array::const_iterator, decltype(acc)>, acc, result);
return 0;
}
我的機器上算法的並行版本比順序算法慢兩倍以上:
parallel_for_each ...
Result = 63750301073
Time: 5448
for_each ...
Result = 63750301073
Time: 2496
我在運行Intel(R)Core(TM)i3-6100 CPU @ 3.70GHz的 Ubuntu Linux上使用GCC 6.2編譯器。
如何解釋這種行為? 這是因為線程和緩存乒乓之間共享atomic<uint64_t>
變量嗎?
我分別用perf進行了分析 。 對於並行版本,統計信息如下:
1137982167 cache-references
247652893 cache-misses # 21,762 % of all cache refs
60868183996 cycles
27409239189 instructions # 0,45 insns per cycle
3287117194 branches
80895 faults
4 migrations
對於順序一:
402791485 cache-references
246561299 cache-misses # 61,213 % of all cache refs
40284812779 cycles
26515783790 instructions # 0,66 insns per cycle
3188784664 branches
48179 faults
3 migrations
顯然,並行版本會生成更多的緩存引用,周期和錯誤,但是為什么呢?
您正在共享同一個result
變量 :所有線程都在atomic<uint64_t> result
上累積, atomic<uint64_t> result
破壞了緩存!
每當線程寫入result
,其他內核中的所有緩存都會失效:這會導致緩存行爭用 。
更多信息:
要寫入存儲位置,內核還必須另外擁有包含該位置的緩存行的專有所有權。 雖然一個內核可以獨占使用,但是其他所有試圖寫入同一內存位置的內核都必須等待並輪流使用-也就是說,它們必須串行運行。 從概念上講,每條高速緩存行好像都受到硬件互斥鎖的保護,在這種情況下,一次只有一個內核可以將硬件鎖持有在該高速緩存行上。
關於“錯誤共享”的這篇文章涵蓋了類似的問題,它更深入地解釋了緩存中發生的情況。
我對您的程序進行了一些修改,並獲得了以下結果(在具有i7-4770K [8個線程+超線程]的計算機上) :
Generating array ...
parallel_for_each ...
Result = 63748111806
Time: 195
for_each ...
Result = 63748111806
Time: 2727
並行版本比串行版本快大約92% 。
std::future
和std::packaged_task
是重量級抽象。 在這種情況下,一個std::experimental::latch
就足夠了。
每個任務都被發送到線程池。這樣可以最大程度地減少線程創建的開銷 。
每個任務都有自己的累加器 。 這消除了共享 。
該代碼可在我的GitHub上找到 。 它使用了一些個人依賴項,但是無論如何您都應該了解所做的更改。
以下是最重要的更改:
// A latch is being used instead of a vector of futures.
ecst::latch l(num_threads - 1);
l.execute_and_wait_until_zero([&]
{
auto block_start = first;
for (unsigned i = 0; i < num_threads - 1; ++i)
{
auto block_end = block_start;
std::advance(block_end, block_size);
// `p` is a thread pool.
// Every task posted in the thread pool has its own `tempacc` accumulator.
p.post([&, block_start, block_end, tempacc = 0ull]() mutable
{
// The task accumulator is filled up...
std::for_each(block_start, block_end, [&tempacc](auto x){ tempacc += x; });
// ...and then the atomic variable is incremented ONCE.
func(tempacc);
l.decrement_and_notify_all();
});
block_start = block_end;
}
// Same idea here: accumulate to local non-atomic counter, then
// add the partial result to the atomic counter ONCE.
auto tempacc2 = 0ull;
std::for_each(block_start, last, [&tempacc2](auto x){ tempacc2 += x; });
func(tempacc2);
});
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.