繁体   English   中英

为什么并行版本的累积会慢得多?

[英]Why would a parallel version of accumulate be so much slower?

Antony Williams的“行动中的C ++并发”的启发,我仔细研究了他的std::accumulate并行版本。 我从书中复制了它的代码并添加了一些输出用于调试目的,这就是我最终得到的:

#include <algorithm>
#include <future>
#include <iostream>
#include <thread>

template <typename Iterator, typename T>
struct accumulate_block
{
  T operator()(Iterator first, Iterator last)
  {
    return std::accumulate(first, last, T());
  }
};

template <typename Iterator, typename T>
T parallel_accumulate(Iterator first, Iterator last, T init)
{
  const unsigned long length = std::distance(first, last);

  if (!length) return init;

  const unsigned long min_per_thread = 25;
  const unsigned long max_threads    = (length) / min_per_thread;
  const unsigned long hardware_conc  = std::thread::hardware_concurrency();
  const unsigned long num_threads    = std::min(hardware_conc != 0 ? hardware_conc : 2, max_threads);
  const unsigned long block_size     = length / num_threads;

  std::vector<std::future<T>> futures(num_threads - 1);
  std::vector<std::thread> threads(num_threads - 1);

  Iterator block_start = first;
  for (unsigned long i = 0; i < (num_threads - 1); ++i)
  {
    Iterator block_end = block_start;
    std::advance(block_end, block_size);

    std::packaged_task<T(Iterator, Iterator)> task{accumulate_block<Iterator, T>()};
    futures[i] = task.get_future();
    threads[i] = std::thread(std::move(task), block_start, block_end);
    block_start = block_end;
  }

  T last_result = accumulate_block<Iterator, T>()(block_start, last);

  for (auto& t : threads) t.join();

  T result = init;
  for (unsigned long i = 0; i < (num_threads - 1); ++i) {
    result += futures[i].get();
  }
  result += last_result;
  return result;
}

template <typename TimeT = std::chrono::microseconds>
struct measure
{
  template <typename F, typename... Args>
  static typename TimeT::rep execution(F func, Args&&... args)
  {
    using namespace std::chrono;
    auto start = system_clock::now();
    func(std::forward<Args>(args)...);
    auto duration = duration_cast<TimeT>(system_clock::now() - start);
    return duration.count();
  }
};

template <typename T>
T parallel(const std::vector<T>& v)
{
  return parallel_accumulate(v.begin(), v.end(), 0);
}

template <typename T>
T stdaccumulate(const std::vector<T>& v)
{
  return std::accumulate(v.begin(), v.end(), 0);
}

int main()
{
  constexpr unsigned int COUNT = 200000000;
  std::vector<int> v(COUNT);

  // optional randomising vector contents - std::accumulate also gives 0us
  // but custom parallel accumulate gives longer times with randomised input
  std::mt19937 mersenne_engine;
  std::uniform_int_distribution<int> dist(1, 100);
  auto gen = std::bind(dist, mersenne_engine);
  std::generate(v.begin(), v.end(), gen);
  std::fill(v.begin(), v.end(), 1);

  auto v2 = v; // copy to work on the same data

  std::cout << "starting ... " << '\n';
  std::cout << "std::accumulate : \t" << measure<>::execution(stdaccumulate<int>, v) << "us" << '\n';
  std::cout << "parallel: \t" << measure<>::execution(parallel<int>, v2) << "us" << '\n';
}

这里最有趣的是几乎总是从std::accumulate得到0长度的时间。

范例输出:

starting ... 
std::accumulate :       0us
parallel: 
inside1 54us

inside2 81830us

inside3 89082us
89770us

这里有什么问题?

http://cpp.sh/6jbt

与微基准测试一样,您需要确保代码实际上正在执行某些操作 你正在accumulate ,但你实际上并没有把结果存储在任何地方或用它做任何事情。 所以你真的需要完成任何工作吗? 编译器只是在正常情况下删除了所有逻辑。 这就是你得到0的原因。

只需更改代码即可确保需要完成工作。 例如:

int s, s2;
std::cout << "starting ... " << '\n';
std::cout << "std::accumulate : \t"
          << measure<>::execution([&]{s = std::accumulate(v.begin(), v.end(), 0);})
          << "us\n";
std::cout << "parallel: \t"
          << measure<>::execution([&]{s2 = parallel_accumulate(v2.begin(), v2.end(), 0);})
          << "us\n";
std::cout << s << ',' << s2 << std::endl;

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM