简体   繁体   English

使用Concurrency :: parallel_for()的边际性能提升

[英]Marginal performance gain using Concurrency::parallel_for()

In my application, I have a for-loop running over roughly ten million items, like this: 在我的应用程序中,我有一个运行于大约一千万个项目的for循环,如下所示:

int main(int argc, char* argv []) 
{
    unsigned int nNodes = 10000000;
    Node** nodeList = new Node* [nNodes];

    initialiseNodes(nodeList);  // nodes are initialised here

    for (unsigned int ii = 0l ii < nNodes; ++ii) 
        nodeList[ii]->update();

    showOutput(nodeList)       // show the output in some way
}

I won't go into detail about how the nodes are exactly initialised or shown. 我不会详细介绍如何精确初始化或显示节点。 What's important is that the Node::update() method is a small method, independent of the other nodes. 重要的是Node::update()方法是一个小方法,独立于其他节点。 Thus, it would be very advantageous to perform this for-loop in parallel. 因此,并行执行该for循环将是非常有利的。 Since it is only a small thing, I wanted to stay away from OpenCL/CUDA/OpenMP this time, so I used the C++ Concurrency::parallel_for instead. 由于这只是一件小事,所以这次我想远离OpenCL / CUDA / OpenMP,所以我改用了C ++ Concurrency::parallel_for So then the code then looks like this: 因此,代码如下所示:

#include <ppl.h>

int main(int argc, char* argv []) 
{
    unsigned int nNodes = 10000000;
    Node** nodeList = new Node* [nNodes];

    initialiseNodes(nodeList);  // nodes are initialised here

    Concurrency::parallel_for(unsigned int(0), nNodes, [&](unsigned int ii) {
            nodeList[ii]->update();
    });

    showOutput(nodeList)       // show the output in some way
}

This indeed speeds up the programme a little bit, but typically by only 20% or so, I found. 我的确发现,这的确确实加快了程序的速度,但通常只提高20%左右。 Frankly, I expected more. 坦率地说,我期望更多。 Can someone tell me if this is a typical speed-up factor when using parallel_for ? 有人可以告诉我在使用parallel_for时这是否是典型的加速因素吗? Or are there ways to get more out of it (without switching to GPU implementations)? 还是有办法从中获得更多收益(无需切换到GPU实现)?

Throwing more cores at a problem will not always yield an improvement. 在问题上投入更多的核心并不一定会带来改善。 In fact, in the worst case it can even reduce performance. 实际上,在最坏的情况下,它甚至可能会降低性能。 To benefit from using multiple cores depends on many things, such as the amount of shared data involved. 受益于使用多个内核取决于很多事情,例如涉及的共享数据量。 Some problems are inherently parallelizable, and some are not. 有些问题本质上是可并行化的,而有些则不是。

I found what I think contributes most heavily to the performance increase. 我发现我认为对性能提升的贡献最大。 Surely, like @anthony-burleigh said, the tasks has to be parallelisable and the amount of shared data influences is as well. 当然,就像@ anthony-burleigh所说的那样,任务必须是可并行化的,并且共享数据的影响也应如此。 What I found, however, is that the computational load of the parallelised method matters far more. 但是,我发现并行化方法的计算量要重要得多。 Big tasks seem to give a higher speed-up than small tasks. 大任务似乎比小任务具有更高的加速率。

So for example, in: 因此,例如:

Concurrency::parallel_for(unsigned int(0), nNodes, [&](unsigned int ii) {
        nodeList[ii]->update();  // <-- very small task
});

I only got a speed-up factor of 1.2. 我的加速因子只有1.2。 However, in a heavy task, like: 但是,在繁重的任务中,例如:

Concurrency::parallel_for(unsigned int(0), nNodes, [&](unsigned int ii) {
        ray[ii]->recursiveRayTrace();  // <-- very heavy task
});

the programme suddenly ran 3 times as fast. 该程序突然以3倍的速度运行。

I am sure that there is a deeper explanation for all this, but this is what I found by trial and error. 我确信所有这些都有更深层的解释,但这是我通过反复试验发现的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM