并行执行比串行执行需要更多时间？

Question

i am studying task implementation in TBB and have run code for parallel and serial calculation of Fibonacci Series. 我正在研究TBB中的任务实现，并具有用于斐波那契数列的并行和串行计算的运行代码。

The Code is : 代码是：

#include <iostream>
#include <list>
#include <tbb/task.h>
#include <tbb/task_group.h>
#include <stdlib.h>
#include "tbb/compat/thread"
#include "tbb/task_scheduler_init.h"
using namespace std;
using namespace tbb;

#define CutOff 2

long serialFib( long n ) {
if( n<2 )
return n;
else
return serialFib(n-1) + serialFib(n-2);
}


class FibTask: public task 
{
    public:
    const long n;
    long* const sum;

    FibTask( long n_, long* sum_ ) : n(n_), sum(sum_) {}

    task* execute() 
    {
        // cout<<"task id of thread is \t"<<this_thread::get_id()<<"FibTask(n)="<<n<<endl;  // Overrides virtual function task::execute    
                // cout<<"Task Stolen is"<<is_stolen_task()<<endl;
        if( n<CutOff ) 
        {
            *sum = serialFib(n);
        }
         else
         {
            long x, y;
            FibTask& a = *new( allocate_child() ) FibTask(n-1,&x);
            FibTask& b = *new( allocate_child() ) FibTask(n-2,&y);
            set_ref_count(3); // 3 = 2 children + 1 for wait // ref_countis used to keep track of the number of tasks spawned at                            the current level of the task graph
            spawn( b );
                      // cout<<"child id of thread is \t"<<this_thread::get_id()<<"calculating n ="<<n<<endl;
            spawn_and_wait_for_all( a ); //set tasks for execution and wait for them
            *sum = x+y;
        }
        return NULL;
    }
};


long parallelFib( long n ) 
{
    long sum;
    FibTask& a = *new(task::allocate_root()) FibTask(n,&sum);
    task::spawn_root_and_wait(a);
    return sum;
}


int main()
{     
     long i,j;
     cout<<fixed;

     cout<<"Fibonacci Series parallelly formed is "<<endl;
      tick_count t0=tick_count::now();
     for(i=0;i<50;i++)
     cout<<parallelFib(i)<<"\t";
    // cout<<"parallel execution of Fibonacci series for n=10 \t"<<parallelFib(i)<<endl;

     tick_count t1=tick_count::now();
     double t=(t1-t0).seconds();
     cout<<"Time Elapsed in Parallel Execution is  \t"<<t<<endl;
     cout<<"\n Fibonacci Series Serially formed is "<<endl;
     tick_count t3=tick_count::now();

     for(j=0;j<50;j++)
     cout<<serialFib(j)<<"\t";
     tick_count t4=tick_count::now();
     double t5=(t4-t3).seconds();
     cout<<"Time Elapsed in Serial  Execution is  \t"<<t5<<endl;
     return(0);
}

Parallel Execution is taking more time as compared to serial execution.In this Parallel Execution took 2500 sec whereas serial took around 167 secs. 与串行执行相比，并行执行要花更多的时间。在这种并行执行中，花费了2500秒，而串行花费了约167秒。 Can anybody pls explain reason for this? 有人可以解释原因吗？

Answer 1

Overhead. 高架。

When your actual task is lightweight, the coordination/communication dominates and you do not (automatically) gain from parallel execution. 如果您的实际任务是轻量级的，则协调/通信将占主导地位，并且您不会（自动）从并行执行中受益。 This is a pretty common issue. 这是一个很常见的问题。

Try instead to compute M Fibonacci numbers (of a high enough cost) serially, then compute them in parallel. 试着依次计算M个斐波那契数（费用足够高），然后并行计算它们。 You should see a gain. 您应该会看到收益。

Answer 2

Change Cutoff to 12, compile with optimization on (-O on Linux; /O2 on Windows), and you should see significant speedup. 将Cutoff更改为12，在（Linux上为-O； Windows上为/ O2）上进行优化编译，您应该会看到明显的加速。

There is plenty of parallelism in the example. 该示例中有很多并行性。 The problem is that with Cutoff=2, the individual units of useful parallel computation are swamped by scheduling overhead. 问题在于，在Cutoff = 2的情况下，有用的并行计算的各个单元会被调度开销所淹没。 Raising the Cutoff value should resolve the problem. 提高截止值应该可以解决该问题。

Here is the analysis. 这是分析。 There are two important times for analyzing parallelism: 分析并行性有两个重要时期：

work - the total amount of computational work. work-计算工作总量。
span - the length of the critical path. span-关键路径的长度。

The available parallelism is work/span. 可用的并行度是工作/跨度。

For fib(n), when n is sufficiently large, the work is roughly proportional to fib(n) [yes, it describes itself!]. 对于fib（n），当n足够大时，功大约与fib（n）成比例[是的，它描述了自己！]。 The span is the depth of the call tree - it is roughly proportional to n. 跨度是调用树的深度-大致与n成正比。 So the parallelism is proportional to fib(n)/n. 因此，并行度与fib（n）/ n成正比。 So even for n=10, there is plenty of available parallelism to keep a typical 2013 desktop machine humming. 因此，即使对于n = 10，也有很多可用的并行性来保持典型的2013台式机嗡嗡作响。

The problem is that TBB tasks take time to create, execute, synchronize, and destroy. 问题在于，TBB任务需要花费一些时间来创建，执行，同步和销毁。 Changing Cutoff from 2 to 12 allows the serial code to take over when the work is so small that scheduling overheads would swamp it. 将截止值从2更改为12，可以使串行代码在工作量很小时接管工作，以至于调度开销会淹没它。 This is a common pattern in recursive parallelism: recurse in parallel until you are down to chunks of work that might as well be done serially. 这是递归并行性中的一种常见模式：并行递归直到您完成可能需要串行完成的工作。 In Other parallel frameworks (like OpenMP or Cilk Plus) have the same issue: there is overhead for tasks, albeit they may be more or less than TBB. 在其他并行框架（如OpenMP或Cilk Plus）中，存在相同的问题：任务有开销，尽管它们可能比TBB多或少。 All that changes is what the best threshold value is. 所有变化就是最佳阈值。

Try varying Cutoff. 尝试改变截止值。 Lower values should give you more parallelism but more scheduling overhead. 较低的值应为您提供更多的并行性，但会增加调度时间。 Higher values give you less parallelism but less scheduling overhead. 较高的值可以减少并行性，但可以减少调度开销。 In between, you will likely find a range of values that give good speedup. 在这两者之间，您可能会找到一定范围的值，这些值可以提供良好的加速效果。

Answer 3

Without more information it will be hard to tell. 没有更多信息，将很难分辨。 you need to check:How many processros your computer have? 您需要检查：您的计算机有多少个进程？ were there any other programs which might have made use of ther processors? 还有其他程序可能会使用这些处理器吗？ if you want to run in (true) parallel and gain performance benefits, than the Operating system must be able to allocate at least 2 free processors. 如果要并行运行并获得性能收益，则操作系统必须至少能够分配2个空闲处理器。 Also, for small tasks , the overhead of allocating threads and collecting their result might exceed the benefits of parallel execution. 同样，对于小型任务，分配线程和收集线程结果的开销可能会超过并行执行的好处。

Answer 4

Am I right in thinking that each task does result of fib(n-1) + result of fib(n-2) - so essentially, you start a task, which then starts another task and so on until we have a very large number of tasks (I got slightly lost trying to count them all - I think it's n squared). 我是否认为每个任务确实result of fib(n-1) + result of fib(n-2) -所以从本质上讲，您启动了一个任务，然后又启动了另一个任务，依此类推，直到有大量任务任务（尝试将其全部数掉，我有些失落-我认为它是n平方）。 And the result of each such task is used to add up the fibonacci number. 每个这样的任务的结果都用于求和斐波那契数。

First of all, there is no actual parallel execution here (other than perhaps two independent recursive calculations). 首先，这里没有实际的并行执行（也许有两个独立的递归计算）。 Every task relies on the result of it's subtask, and can't really do anything in parallel. 每个任务都依赖于其子任务的结果，并且实际上不能并行执行任何操作。 On the other hand, you are performing a whole lot of work to set up each task. 另一方面，您正在执行大量工作来设置每个任务。 Not at all surprising that you don't see any benefit) 一点都不奇怪，您看不到任何好处）

Now, if you were to calculate the fibonacci numbers 1 .. 50 by iteration, and you started, say, one task per processor core in your system, and compared that to an iterative solution using just a single loop, I'm sure that would show a much better improvement. 现在，如果您要通过迭代计算斐波那契数1 .. 50，然后开始在系统中的每个处理器内核中开始一项任务，并将其与仅使用一个循环的迭代解决方案进行比较，那么我相信将显示出更好的改进。

并行执行比串行执行需要更多时间？

问题描述

4 个解决方案

解决方案1
6 已采纳 2013-03-14 14:31:29

解决方案2
2 2013-03-15 01:35:39

解决方案3
0 2013-03-14 14:32:54

解决方案4
0 2013-03-14 14:35:05

并行执行比串行执行需要更多时间？

问题描述

4 个解决方案

解决方案1 6 已采纳 2013-03-14 14:31:29

解决方案2 2 2013-03-15 01:35:39

解决方案3 0 2013-03-14 14:32:54

解决方案4 0 2013-03-14 14:35:05

解决方案1
6 已采纳 2013-03-14 14:31:29

解决方案2
2 2013-03-15 01:35:39

解决方案3
0 2013-03-14 14:32:54

解决方案4
0 2013-03-14 14:35:05