在线程中运行的任务比在串行中运行需要更长的时间？

Question

So im doing some computation on 4 million nodes.所以我在 400 万个节点上做了一些计算。

the very bask serial version just have a for loop which loops 4 million times and do 4 million times of computation.非常 bask 的串行版本只有一个 for 循环，它循环 400 万次并进行 400 万次计算。 this takes roughly 1.2 sec.这大约需要 1.2 秒。

when I split the for loop to, say, 4 for loops and each does 1/4 of the computation, the total time became 1.9 sec.当我将 for 循环拆分为 4 个 for 循环并且每个循环执行 1/4 的计算时，总时间变为 1.9 秒。

I guess there are some overhead in creating for loops and maybe has to do with cpu likes to compute data in chunk.我想创建 for 循环会有一些开销，可能与 cpu 喜欢以块为单位计算数据有关。

The real thing bothers me is when I try to put 4 loops to 4 thread on a 8 core machine, each thread would take 0.9 seconds to finish.真正困扰我的是，当我尝试在 8 核机器上将 4 个循环放入 4 个线程时，每个线程需要 0.9 秒才能完成。 I am expecting each of them to only take 1.9/4 second instead.我希望他们每个人只需要 1.9/4 秒。

I dont think there are any race condition or synchronize issue since all I do was having a for loop to create 4 threads, which took 200 microseconds.我不认为有任何竞争条件或同步问题，因为我所做的只是用一个 for 循环来创建 4 个线程，这花费了 200 微秒。 And then a for loop to joins them.然后是一个 for 循环来加入它们。

The computation read from a shared array and write to a different shared array.计算从共享数组读取并写入不同的共享数组。 I am sure they are not writing to the same byte.我确定他们没有写入同一个字节。

Where could the overhead came from?开销从何而来？

main: ncores: number of cores. main: ncores：核心数。 node_size: size of graph (4 million node) node_size：图的大小（400万个节点）

        for(i = 0 ; i < ncores ; i++){
            int *t = (int*)malloc(sizeof(int));
            *t = i;
            int iret = pthread_create( &thread[i], NULL, calculate_rank_p, (void*)(t));

        }
        for (i = 0; i < ncores; i++)
        {
            pthread_join(thread[i], NULL);
        }

calculate_rank_p: vector is the rank vector for page rank calculation calculate_rank_p: vector为页面排名计算的排名向量

Void *calculate_rank_pthread(void *argument) {
    int index = *(int*)argument;
    for(i = index; i < node_size ; i+=ncores)     
       current_vector[i] = calc_r(i, vector);
    return NULL;     
}

calc_r: this is just a page rank calculation using compressed row format. calc_r：这只是使用压缩行格式的页面排名计算。

double calc_r(int i, double *vector){
    double prank = 0;
    int j;
    for(j = row_ptr[i]; j < row_ptr[i+1]; j++){
        prank += vector[col_ind[j]] * val[j];
    }
    return prank;
}

everything that is not declared are global variable所有未声明的都是全局变量

Answer 1

The computation read from a shared array and write to a different shared array. 计算从共享数组读取，然后写入不同的共享数组。 I am sure they are not writing to the same byte. 我确定他们没有写相同的字节。

It's impossible to be sure without seeing relevant code and having some more details, but this sounds like it could be due to false sharing, or ... 无法确保不查看相关代码并获得更多详细信息，但这听起来像是由于错误的共享或...

the performance issue of false sharing (aka cache line ping-ponging), where threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time. 错误共享的性能问题（也称为高速缓存行ping-ponging），其中线程使用不同的对象，但这些对象在内存中恰好相距很近，以至于它们落在同一高速缓存行上，并且高速缓存系统将它们视为单个块硬件写锁有效地保护了该写锁，一次只能拥有一个内核。 This causes real but invisible performance contention; 这引起了真实但不可见的性能争用； whichever thread currently has exclusive ownership so that it can physically perform an update to the cache line will silently throttle other threads that are trying to use different (but, alas, nearby) data that sits on the same line. 无论哪个线程当前具有独占所有权，以便它可以物理地执行对缓存行的更新，将默默地限制正在尝试使用位于同一行上的不同（但是，附近）数据的其他线程。

http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206 http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206

UPDATE 更新

This looks like it could very well trigger false sharing, depending on the size of a vector (though there is still not enough information in the post to be sure, as we don't see how the various vector are allocated. 看起来很可能会触发错误共享，具体取决于向量的大小（尽管帖子中仍然没有足够的信息可以确定，因为我们看不到各个vector的分配方式。

for(i = index; i < node_size ; i+=ncores)

Instead of interleaving which core works on which data i += ncores give each of them a range of data to work on. i += ncores交织哪个内核可以处理哪些数据， i += ncores给每个i += ncores提供一定范围的数据。

Answer 2

For me the same surprise when build and run in Debug (other test code though).对我来说，在调试中构建和运行时同样令人惊讶（尽管是其他测试代码）。

In release all as expected;)全部按预期发布；）

在线程中运行的任务比在串行中运行需要更长的时间？

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-10-30 15:25:18

解决方案2
0 2022-04-03 21:36:36

在线程中运行的任务比在串行中运行需要更长的时间？

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-10-30 15:25:18

解决方案2 0 2022-04-03 21:36:36

解决方案1
2 已采纳 2015-10-30 15:25:18

解决方案2
0 2022-04-03 21:36:36