并行for_each比std :: for_each慢两倍以上

Question

I'm reading C++ Concurrency in Action by Anthony Williams . 我正在阅读Anthony Williams 撰写的C ++ Concurrency in Action 。 In the chapter about designing concurrent code there is parallel version of std::for_each algorihtm. 在有关设计并发代码的章节中，有并行版本的std :: for_each algorihtm。 Here is slightly modified code from the book: 这是本书中经过稍微修改的代码：

join_thread.hpp join_thread.hpp

#pragma once

#include <vector>
#include <thread>

class join_threads
{
public:
  explicit join_threads(std::vector<std::thread>& threads)
    : threads_(threads) {}

  ~join_threads()
  {
    for (size_t i = 0; i < threads_.size(); ++i)
    {
      if(threads_[i].joinable())
      {
        threads_[i].join();
      }
    }
  }

private:
  std::vector<std::thread>& threads_;
};

parallel_for_each.hpp parallel_for_each.hpp

#pragma once

#include <future>
#include <algorithm>

#include "join_threads.hpp"

template<typename Iterator, typename Func>
void parallel_for_each(Iterator first, Iterator last, Func func)
{
  const auto length = std::distance(first, last);
  if (0 == length) return;

  const auto min_per_thread = 25u;
  const unsigned max_threads = (length + min_per_thread - 1) / min_per_thread;

  const auto hardware_threads = std::thread::hardware_concurrency();

  const auto num_threads= std::min(hardware_threads != 0 ?
        hardware_threads : 2u, max_threads);

  const auto block_size = length / num_threads;

  std::vector<std::future<void>> futures(num_threads - 1);
  std::vector<std::thread> threads(num_threads-1);
  join_threads joiner(threads);

  auto block_start = first;
  for (unsigned i = 0; i < num_threads - 1; ++i)
  {
    auto block_end = block_start;
    std::advance(block_end, block_size);
    std::packaged_task<void (void)> task([block_start, block_end, func]()
    {
      std::for_each(block_start, block_end, func);
    });
    futures[i] = task.get_future();
    threads[i] = std::thread(std::move(task));
    block_start = block_end;
  }

  std::for_each(block_start, last, func);

  for (size_t i = 0; i < num_threads - 1; ++i)
  {
    futures[i].get();
  }
}

I benchmarked it with sequential version of std::for_each using the following program: 我使用以下程序使用顺序版本的std :: for_each对它进行了基准测试：

main.cpp main.cpp

#include <iostream>
#include <random>
#include <chrono>

#include "parallel_for_each.hpp"

using namespace std;

constexpr size_t ARRAY_SIZE = 500'000'000;
typedef std::vector<uint64_t> Array;

template <class FE, class F>
void test_for_each(const Array& a, FE fe, F f, atomic<uint64_t>& result)
{
  auto time_begin = chrono::high_resolution_clock::now();
  result = 0;
  fe(a.begin(), a.end(), f);
  auto time_end = chrono::high_resolution_clock::now();

  cout << "Result = " << result << endl;
  cout << "Time: " << chrono::duration_cast<chrono::milliseconds>(
            time_end - time_begin).count() << endl;
}

int main()
{
  random_device device;
  default_random_engine engine(device());
  uniform_int_distribution<uint8_t> distribution(0, 255);

  Array a;
  a.reserve(ARRAY_SIZE);

  cout << "Generating array ... " << endl;
  for (size_t i = 0; i < ARRAY_SIZE; ++i)
    a.push_back(distribution(engine));

  atomic<uint64_t> result;
  auto acc = [&result](uint64_t value) { result += value; };

  cout << "parallel_for_each ..." << endl;
  test_for_each(a, parallel_for_each<Array::const_iterator, decltype(acc)>, acc, result);
  cout << "for_each ..." << endl;
  test_for_each(a, for_each<Array::const_iterator, decltype(acc)>, acc, result);

  return 0;
}

The parallel version of the algorithm on my machine is more than two times slower than sequential one: 我的机器上算法的并行版本比顺序算法慢两倍以上：

parallel_for_each ...
Result = 63750301073
Time: 5448
for_each ...
Result = 63750301073
Time: 2496

I'm using GCC 6.2 compiler on Ubuntu Linux running on Intel(R) Core(TM) i3-6100 CPU @ 3.70GHz . 我在运行Intel（R）Core（TM）i3-6100 CPU @ 3.70GHz的 Ubuntu Linux上使用GCC 6.2编译器。

How such a behavior can be explained? 如何解释这种行为？ Is this because of sharing of atomic<uint64_t> variable between threads and cache ping-pong? 这是因为线程和缓存乒乓之间共享atomic<uint64_t>变量吗？

I profiled both separately with perf . 我分别用perf进行了分析 。 For the parallel version the stats are the following: 对于并行版本，统计信息如下：

 1137982167      cache-references                                            
  247652893      cache-misses              #   21,762 % of all cache refs    
60868183996      cycles                                                      
27409239189      instructions              #    0,45  insns per cycle        
 3287117194      branches                                                    
      80895      faults                                                      
          4      migrations

And for the sequential one: 对于顺序一：

  402791485      cache-references                                            
  246561299      cache-misses              #   61,213 % of all cache refs    
40284812779      cycles                                                      
26515783790      instructions              #    0,66  insns per cycle
 3188784664      branches                                                    
      48179      faults
          3      migrations

It is obvious that the parallel version generates far more cache references, cycles and faults but why? 显然，并行版本会生成更多的缓存引用，周期和错误，但是为什么呢？

Answer 1

You are sharing the same result variable : all the threads are accumulating on atomic<uint64_t> result , thrashing the cache! 您正在共享同一个result变量 ：所有线程都在atomic<uint64_t> result上累积， atomic<uint64_t> result破坏了缓存！

Every time a thread writes to result , all the caches in the other cores are invalidated: this leads to cache line contention . 每当线程写入result ，其他内核中的所有缓存都会失效：这会导致缓存行争用 。

More information: 更多信息：

"Sharing Is the Root of All Contention" . “共享是所有争论的根源” 。

[...] to write to a memory location a core must additionally have exclusive ownership of the cache line containing that location. 要写入存储位置，内核还必须另外拥有包含该位置的缓存行的专有所有权。 While one core has exclusive use, all other cores trying to write the same memory location must wait and take turns — that is, they must run serially. 虽然一个内核可以独占使用，但是其他所有试图写入同一内存位置的内核都必须等待并轮流使用-也就是说，它们必须串行运行。 Conceptually, it's as if each cache line were protected by a hardware mutex, where only one core can hold the hardware lock on that cache line at a time. 从概念上讲，每条高速缓存行好像都受到硬件互斥锁的保护，在这种情况下，一次只有一个内核可以将硬件锁持有在该高速缓存行上。
This article on "false sharing" , which covers a similar issue, explains more in depth what happens in the caches. 关于“错误共享”的这篇文章涵盖了类似的问题，它更深入地解释了缓存中发生的情况。

I made some modifications to your program and achieved the following results (on a machine with an i7-4770K [8 threads + hyperthreading]) : 我对您的程序进行了一些修改，并获得了以下结果（在具有i7-4770K [8个线程+超线程]的计算机上） ：

Generating array ...
parallel_for_each ...
Result = 63748111806
Time: 195
for_each ...
Result = 63748111806
Time: 2727

The parallel version is roughly 92% faster than the serial version. 并行版本比串行版本快大约92％ 。

std::future and std::packaged_task are heavyweight abstractions. std::future和std::packaged_task是重量级抽象。 In this case, an std::experimental::latch is sufficient. 在这种情况下，一个std::experimental::latch就足够了。
Every task is sent to a thread pool This minimizes thread creation overhead . 每个任务都被发送到线程池。这样可以最大程度地减少线程创建的开销 。
Every task has its own accumulator . 每个任务都有自己的累加器 。 This eliminates sharing . 这消除了共享 。

The code is available here on my GitHub . 该代码可在我的GitHub上找到 。 It uses some personal dependencies, but you should understand the changes regardless. 它使用了一些个人依赖项，但是无论如何您都应该了解所做的更改。

Here are the most important changes: 以下是最重要的更改：

// A latch is being used instead of a vector of futures.
ecst::latch l(num_threads - 1);

l.execute_and_wait_until_zero([&]
{
    auto block_start = first;
    for (unsigned i = 0; i < num_threads - 1; ++i)
    {
        auto block_end = block_start;
        std::advance(block_end, block_size);

        // `p` is a thread pool.
        // Every task posted in the thread pool has its own `tempacc` accumulator.
        p.post([&, block_start, block_end, tempacc = 0ull]() mutable
        {
            // The task accumulator is filled up...
            std::for_each(block_start, block_end, [&tempacc](auto x){ tempacc += x; });

            // ...and then the atomic variable is incremented ONCE.
            func(tempacc);
            l.decrement_and_notify_all();
        });

        block_start = block_end;
    }

    // Same idea here: accumulate to local non-atomic counter, then
    // add the partial result to the atomic counter ONCE.
    auto tempacc2 = 0ull;
    std::for_each(block_start, last, [&tempacc2](auto x){ tempacc2 += x; });
    func(tempacc2);
});

并行for_each比std :: for_each慢两倍以上

问题描述

1 个解决方案

解决方案1
5 已采纳 2016-11-25 13:03:51

并行for_each比std :: for_each慢两倍以上

问题描述

1 个解决方案

解决方案1 5 已采纳 2016-11-25 13:03:51

解决方案1
5 已采纳 2016-11-25 13:03:51