简体   繁体   English

在线程中运行的任务比在串行中运行需要更长的时间?

[英]tasks run in thread takes longer than in serial?

So im doing some computation on 4 million nodes.所以我在 400 万个节点上做了一些计算。

the very bask serial version just have a for loop which loops 4 million times and do 4 million times of computation.非常 bask 的串行版本只有一个 for 循环,它循环 400 万次并进行 400 万次计算。 this takes roughly 1.2 sec.这大约需要 1.2 秒。

when I split the for loop to, say, 4 for loops and each does 1/4 of the computation, the total time became 1.9 sec.当我将 for 循环拆分为 4 个 for 循环并且每个循环执行 1/4 的计算时,总时间变为 1.9 秒。

I guess there are some overhead in creating for loops and maybe has to do with cpu likes to compute data in chunk.我想创建 for 循环会有一些开销,可能与 cpu 喜欢以块为单位计算数据有关。

The real thing bothers me is when I try to put 4 loops to 4 thread on a 8 core machine, each thread would take 0.9 seconds to finish.真正困扰我的是,当我尝试在 8 核机器上将 4 个循环放入 4 个线程时,每个线程需要 0.9 秒才能完成。 I am expecting each of them to only take 1.9/4 second instead.我希望他们每个人只需要 1.9/4 秒。

I dont think there are any race condition or synchronize issue since all I do was having a for loop to create 4 threads, which took 200 microseconds.我不认为有任何竞争条件或同步问题,因为我所做的只是用一个 for 循环来创建 4 个线程,这花费了 200 微秒。 And then a for loop to joins them.然后是一个 for 循环来加入它们。

The computation read from a shared array and write to a different shared array.计算从共享数组读取并写入不同的共享数组。 I am sure they are not writing to the same byte.我确定他们没有写入同一个字节。

Where could the overhead came from?开销从何而来?

main: ncores: number of cores. main: ncores:核心数。 node_size: size of graph (4 million node) node_size:图的大小(400万个节点)

        for(i = 0 ; i < ncores ; i++){
            int *t = (int*)malloc(sizeof(int));
            *t = i;
            int iret = pthread_create( &thread[i], NULL, calculate_rank_p, (void*)(t));

        }
        for (i = 0; i < ncores; i++)
        {
            pthread_join(thread[i], NULL);
        }

calculate_rank_p: vector is the rank vector for page rank calculation calculate_rank_p: vector为页面排名计算的排名向量

Void *calculate_rank_pthread(void *argument) {
    int index = *(int*)argument;
    for(i = index; i < node_size ; i+=ncores)     
       current_vector[i] = calc_r(i, vector);
    return NULL;     
}

calc_r: this is just a page rank calculation using compressed row format. calc_r:这只是使用压缩行格式的页面排名计算。

double calc_r(int i, double *vector){
    double prank = 0;
    int j;
    for(j = row_ptr[i]; j < row_ptr[i+1]; j++){
        prank += vector[col_ind[j]] * val[j];
    }
    return prank;
}

everything that is not declared are global variable所有未声明的都是全局变量

The computation read from a shared array and write to a different shared array. 计算从共享数组读取,然后写入不同的共享数组。 I am sure they are not writing to the same byte. 我确定他们没有写相同的字节。

It's impossible to be sure without seeing relevant code and having some more details, but this sounds like it could be due to false sharing, or ... 无法确保不查看相关代码并获得更多详细信息,但这听起来像是由于错误的共享或...

the performance issue of false sharing (aka cache line ping-ponging), where threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time. 错误共享的性能问题(也称为高速缓存行ping-ponging),其中线程使用不同的对象,但这些对象在内存中恰好相距很近,以至于它们落在同一高速缓存行上,并且高速缓存系统将它们视为单个块硬件写锁有效地保护了该写锁,一次只能拥有一个内核。 This causes real but invisible performance contention; 这引起了真实但不可见的性能争用; whichever thread currently has exclusive ownership so that it can physically perform an update to the cache line will silently throttle other threads that are trying to use different (but, alas, nearby) data that sits on the same line. 无论哪个线程当前具有独占所有权,以便它可以物理地执行对缓存行的更新,将默默地限制正在尝试使用位于同一行上的不同(但是,附近)数据的其他线程。

http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206 http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206

UPDATE 更新

This looks like it could very well trigger false sharing, depending on the size of a vector (though there is still not enough information in the post to be sure, as we don't see how the various vector are allocated. 看起来很可能会触发错误共享,具体取决于向量的大小(尽管帖子中仍然没有足够的信息可以确定,因为我们看不到各个vector的分配方式。

for(i = index; i < node_size ; i+=ncores) 

Instead of interleaving which core works on which data i += ncores give each of them a range of data to work on. i += ncores交织哪个内核可以处理哪些数据, i += ncores给每个i += ncores提供一定范围的数据。

For me the same surprise when build and run in Debug (other test code though).对我来说,在调试中构建和运行时同样令人惊讶(尽管是其他测试代码)。

In release all as expected;)全部按预期发布;)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 C语言中的多线程程序-比单线程运行时间更长 - Multi-thread program in C - takes longer to run than single thread 使用 OpenMP 并行执行比 C 中的串行执行花费更长的时间? - Parallel exection using OpenMP takes longer than serial execution in C? 线程名称超过15个字符? - Thread name longer than 15 chars? 使用 OpenMP 执行并行代码比执行串行代码需要更多时间 - Parallel code with OpenMP takes more time to execute than serial code 在唤醒线程时,futex如何比互斥体花费更长的时间? - How come a futex take longer than a mutex on waking a thread? 串行代码比在C语言中仅使用一个线程慢得多? - Serial code much slower than using only one thread in C? 与使用一个OpenMP线程并行执行相比,串行执行要快于并行执行 - Serial Execution faster than Parallel Execution with one thread of OpenMP 线程池-处理任务多于线程的情况 - Thread pool - handle a case when there are more tasks than threads OpenMP 基本温度建模,并行处理时间比串行代码长 - OpenMP basic temperature modelling with parallel processing taking longer than serial code C-crypt()-使用5个循环或比使用4个循环执行代码花费的时间更长,但使用相同的参数/哈希 - C - crypt() - Code takes longer to execute with 5 loops or more than with 4 loops but using the same parameter/hash
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM