在Linux上，GCC / pthread并行代码比简单的单线程代码慢得多

Question

I am testing pthread parallel code on Linux with gcc (GCC) 4.8.3 20140911, on a CentOS 7 Server. 我正在使用gcc（GCC）4.8.3 20140911在CentOS 7服务器上测试Linux上的pthread并行代码。

The single thread version is simple, it is used to init a 10000 * 10000 matrix : 单线程版本很简单，它用于初始化10000 * 10000矩阵：

int main(int argc)
{
    int size = 10000;

    int * r = (int*)malloc(size * size * sizeof(int));
    for (int i=0; i<size; i++) {
            for (int j=0; j<size; j++) {
                r[i * size + j] = rand();
            }
    }
    free(r);
}

Then I wanted to see if parallel code can improve the performance: 然后我想看看并行代码是否可以提高性能：

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

int size = 10000;

void *SetOdd(void *param) 
{
   printf("Enter odd\n"); 
   int * r      = (int*)param;
   for (int i=0; i<size; i+=2) {
         for (int j=0; j<size; j++) {
                r[i * size + j] = rand();
         }
   }
   printf("Exit Odd\n");
   pthread_exit(NULL);
   return 0;
} 

void *SetEven(void *param) 
{ 
   printf("Enter Even\n");
   int * r      = (int*)param;
   for (int i=1; i<size; i+=2) {
        for (int j=0; j<size; j++) {
                r[i * size + j] = rand();
        }
   }
   printf("Exit Even\n");
   pthread_exit(NULL);
   return 0;
} 

int main(int argc)
{
     printf("running in thread\n");
     pthread_t threads[2];
     int * r = (int*)malloc(size * size * sizeof(int));
     int rc0 = pthread_create(&threads[0], NULL, SetOdd, (void *)r); 
     int rc1 = pthread_create(&threads[1], NULL, SetEven, (void *)r); 
     for(int t=0; t<2; t++) {
           void* status;
           int rc = pthread_join(threads[t], &status);
           if (rc)  {
               printf("ERROR; return code from pthread_join()   is %d\n", rc);
               exit(-1);
            }
            printf("Completed join with thread %d status= %ld\n",t,      (long)status);
        }

   free(r);
   return 0;
}

The simple code runs for about 0.8 second, while the multiple threaded version runs for about 10 seconds!!!!!!! 简单的代码运行大约0.8秒，而多线程版本运行大约10秒!!!!!!!

I am running on a 4 core server. 我在4核服务器上运行。 But why the multiple threaded version is so slow ? 但为什么多线程版本如此之慢？

Answer 1

rand() is neither thread-safe nor re-entrant. rand()既不是线程安全的，也不是可重入的。 So you can't use rand() in multi-threaded applications. 所以你不能在多线程应用程序中使用rand() 。

Use rand_r() instead which is also a pseudo-random generator and is thread-safe. 使用rand_r()代替它也是一个伪随机生成器并且是线程安全的。 If you care about. 如果你在乎。 Using rand_r() results in shorter execution time for your code on my system with 2 cores (roughly half the time as the single threaded version). 使用rand_r()可以缩短我的系统上具有2个内核的代码的执行时间（大约是单线程版本的一半）。

In both of your threads functions, do: 在两个线程函数中，执行：

void *SetOdd(void *param)
{
   printf("Enter odd\n");
   unsigned int s = (unsigned int)time(0);

   int * r      = (int*)param;
   for (int i=0; i<size; i+=2) {
         for (int j=0; j<size; j++) {
                r[i * size + j] = rand_r(&s);
         }
   }
   printf("Exit Odd\n");
   pthread_exit(NULL);
   return 0;
}

Update: 更新：

While C and POSIX standards do mandate rand() to be a thread-safe function, the glibc implementation (used on Linux) actually does implement it in a thread-safe manner. 虽然C和POSIX标准要求rand()是一个线程安全的函数，但glibc实现（在Linux上使用）实际上确实以线程安全的方式实现它。

If we look at the glibc implementation of the rand() , there's a lock: 如果我们看一下rand（）的glibc实现，就会有一个锁：

 291   __libc_lock_lock (lock);
 292 
 293   (void) __random_r (&unsafe_state, &retval);
 294 
 295   __libc_lock_unlock (lock);
 296

Any synchronization construct (mutex, conditional variable etc) is bad for performance ie the least number of such constructs used in the code the better it is for performance (of course, we can't avoid certain them completely in multi-threaded applications). 任何同步构造（互斥，条件变量等）都不利于性能，即代码中使用的此类构造的数量越少，性能就越好（当然，我们无法完全避免在多线程应用程序中确定它们）。

So only one thread can actually access the random number generator as both threads are fighting for the lock all the time. 因此，只有一个线程可以实际访问随机数生成器，因为两个线程一直在争夺锁定。 This explains why rand() leads to poor performance in multi-threaded code. 这解释了为什么rand()导致多线程代码性能不佳。

Answer 2

The rand() function is designed to produce a predictable sequence of the random numbers (and the seed of the sequence can be controlled by the srand() function). rand()函数用于产生可预测的随机数序列（序列的种子可以由srand()函数控制）。 That implies that the function has internal state, in all likelihood protected by a mutex. 这意味着该函数具有内部状态，很可能受互斥锁保护。

The presence of the lock can be confirmed by using eg gprof or valgrind --tool=callgrind tools. 可以使用例如gprof或valgrind --tool=callgrind工具来确认锁的存在。 (For gprof to detect the problems related to the standard library, you would need to compile/link the application with -static .) （要使gprof检测与标准库相关的问题，您需要使用-static编译/链接应用程序。）

In single-threaded mode, the mutex is inactive. 在单线程模式下，互斥锁处于非活动状态。 But in multi-threaded mode, the mutex causes permanent collisions and stalls of the threads, both fighting to acquire the same lock in a tight loop. 但是在多线程模式下，互斥锁会导致线程的永久性冲突和停顿，这两者都在紧密的循环中争取获得相同的锁定。 That severely degrade the multi-threaded performance. 这严重降低了多线程性能。

在Linux上，GCC / pthread并行代码比简单的单线程代码慢得多

问题描述

2 个解决方案

解决方案1
5 2015-12-02 10:37:43

解决方案2
3 2015-12-02 10:45:18

在Linux上，GCC / pthread并行代码比简单的单线程代码慢得多

问题描述

2 个解决方案

解决方案1 5 2015-12-02 10:37:43

解决方案2 3 2015-12-02 10:45:18

解决方案1
5 2015-12-02 10:37:43

解决方案2
3 2015-12-02 10:45:18