并行for_each比std :: for_each慢两倍以上

Question

我正在阅读Anthony Williams 撰写的C ++ Concurrency in Action 。 在有关设计并发代码的章节中，有并行版本的std :: for_each algorihtm。 这是本书中经过稍微修改的代码：

join_thread.hpp

#pragma once

#include <vector>
#include <thread>

class join_threads
{
public:
  explicit join_threads(std::vector<std::thread>& threads)
    : threads_(threads) {}

  ~join_threads()
  {
    for (size_t i = 0; i < threads_.size(); ++i)
    {
      if(threads_[i].joinable())
      {
        threads_[i].join();
      }
    }
  }

private:
  std::vector<std::thread>& threads_;
};

parallel_for_each.hpp

#pragma once

#include <future>
#include <algorithm>

#include "join_threads.hpp"

template<typename Iterator, typename Func>
void parallel_for_each(Iterator first, Iterator last, Func func)
{
  const auto length = std::distance(first, last);
  if (0 == length) return;

  const auto min_per_thread = 25u;
  const unsigned max_threads = (length + min_per_thread - 1) / min_per_thread;

  const auto hardware_threads = std::thread::hardware_concurrency();

  const auto num_threads= std::min(hardware_threads != 0 ?
        hardware_threads : 2u, max_threads);

  const auto block_size = length / num_threads;

  std::vector<std::future<void>> futures(num_threads - 1);
  std::vector<std::thread> threads(num_threads-1);
  join_threads joiner(threads);

  auto block_start = first;
  for (unsigned i = 0; i < num_threads - 1; ++i)
  {
    auto block_end = block_start;
    std::advance(block_end, block_size);
    std::packaged_task<void (void)> task([block_start, block_end, func]()
    {
      std::for_each(block_start, block_end, func);
    });
    futures[i] = task.get_future();
    threads[i] = std::thread(std::move(task));
    block_start = block_end;
  }

  std::for_each(block_start, last, func);

  for (size_t i = 0; i < num_threads - 1; ++i)
  {
    futures[i].get();
  }
}

我使用以下程序使用顺序版本的std :: for_each对它进行了基准测试：

main.cpp

#include <iostream>
#include <random>
#include <chrono>

#include "parallel_for_each.hpp"

using namespace std;

constexpr size_t ARRAY_SIZE = 500'000'000;
typedef std::vector<uint64_t> Array;

template <class FE, class F>
void test_for_each(const Array& a, FE fe, F f, atomic<uint64_t>& result)
{
  auto time_begin = chrono::high_resolution_clock::now();
  result = 0;
  fe(a.begin(), a.end(), f);
  auto time_end = chrono::high_resolution_clock::now();

  cout << "Result = " << result << endl;
  cout << "Time: " << chrono::duration_cast<chrono::milliseconds>(
            time_end - time_begin).count() << endl;
}

int main()
{
  random_device device;
  default_random_engine engine(device());
  uniform_int_distribution<uint8_t> distribution(0, 255);

  Array a;
  a.reserve(ARRAY_SIZE);

  cout << "Generating array ... " << endl;
  for (size_t i = 0; i < ARRAY_SIZE; ++i)
    a.push_back(distribution(engine));

  atomic<uint64_t> result;
  auto acc = [&result](uint64_t value) { result += value; };

  cout << "parallel_for_each ..." << endl;
  test_for_each(a, parallel_for_each<Array::const_iterator, decltype(acc)>, acc, result);
  cout << "for_each ..." << endl;
  test_for_each(a, for_each<Array::const_iterator, decltype(acc)>, acc, result);

  return 0;
}

我的机器上算法的并行版本比顺序算法慢两倍以上：

parallel_for_each ...
Result = 63750301073
Time: 5448
for_each ...
Result = 63750301073
Time: 2496

我在运行Intel（R）Core（TM）i3-6100 CPU @ 3.70GHz的 Ubuntu Linux上使用GCC 6.2编译器。

如何解释这种行为？ 这是因为线程和缓存乒乓之间共享atomic<uint64_t>变量吗？

我分别用perf进行了分析 。 对于并行版本，统计信息如下：

 1137982167      cache-references                                            
  247652893      cache-misses              #   21,762 % of all cache refs    
60868183996      cycles                                                      
27409239189      instructions              #    0,45  insns per cycle        
 3287117194      branches                                                    
      80895      faults                                                      
          4      migrations

对于顺序一：

  402791485      cache-references                                            
  246561299      cache-misses              #   61,213 % of all cache refs    
40284812779      cycles                                                      
26515783790      instructions              #    0,66  insns per cycle
 3188784664      branches                                                    
      48179      faults
          3      migrations

显然，并行版本会生成更多的缓存引用，周期和错误，但是为什么呢？

Answer 1

您正在共享同一个result变量 ：所有线程都在atomic<uint64_t> result上累积， atomic<uint64_t> result破坏了缓存！

每当线程写入result ，其他内核中的所有缓存都会失效：这会导致缓存行争用 。

更多信息：

“共享是所有争论的根源” 。

要写入存储位置，内核还必须另外拥有包含该位置的缓存行的专有所有权。 虽然一个内核可以独占使用，但是其他所有试图写入同一内存位置的内核都必须等待并轮流使用-也就是说，它们必须串行运行。 从概念上讲，每条高速缓存行好像都受到硬件互斥锁的保护，在这种情况下，一次只有一个内核可以将硬件锁持有在该高速缓存行上。
关于“错误共享”的这篇文章涵盖了类似的问题，它更深入地解释了缓存中发生的情况。

我对您的程序进行了一些修改，并获得了以下结果（在具有i7-4770K [8个线程+超线程]的计算机上） ：

Generating array ...
parallel_for_each ...
Result = 63748111806
Time: 195
for_each ...
Result = 63748111806
Time: 2727

并行版本比串行版本快大约92％ 。

std::future和std::packaged_task是重量级抽象。 在这种情况下，一个std::experimental::latch就足够了。
每个任务都被发送到线程池。这样可以最大程度地减少线程创建的开销 。
每个任务都有自己的累加器 。 这消除了共享 。

该代码可在我的GitHub上找到 。 它使用了一些个人依赖项，但是无论如何您都应该了解所做的更改。

以下是最重要的更改：

// A latch is being used instead of a vector of futures.
ecst::latch l(num_threads - 1);

l.execute_and_wait_until_zero([&]
{
    auto block_start = first;
    for (unsigned i = 0; i < num_threads - 1; ++i)
    {
        auto block_end = block_start;
        std::advance(block_end, block_size);

        // `p` is a thread pool.
        // Every task posted in the thread pool has its own `tempacc` accumulator.
        p.post([&, block_start, block_end, tempacc = 0ull]() mutable
        {
            // The task accumulator is filled up...
            std::for_each(block_start, block_end, [&tempacc](auto x){ tempacc += x; });

            // ...and then the atomic variable is incremented ONCE.
            func(tempacc);
            l.decrement_and_notify_all();
        });

        block_start = block_end;
    }

    // Same idea here: accumulate to local non-atomic counter, then
    // add the partial result to the atomic counter ONCE.
    auto tempacc2 = 0ull;
    std::for_each(block_start, last, [&tempacc2](auto x){ tempacc2 += x; });
    func(tempacc2);
});

并行for_each比std :: for_each慢两倍以上

问题描述

1 个解决方案

解决方案1
5 已采纳 2016-11-25 13:03:51

并行for_each比std :: for_each慢两倍以上

问题描述

1 个解决方案

解决方案1 5 已采纳 2016-11-25 13:03:51

解决方案1
5 已采纳 2016-11-25 13:03:51