随着线程的增加，Pthread程序运行速度变慢

Question

I'm a beginner in parallel programming and I tried to write a parallel program with pthread library. 我是并行编程的初学者，我尝试用pthread库编写并行程序。 I ran the program on a 8 processor computer. 我在8处理器计算机上运行程序。 The problem is that when I increase NumProcs , each thread slows down though their tasks are always the same. 问题在于，当我增加NumProcs ，每个线程都会减慢，尽管它们的任务总是相同的。 Can someone help me to figure out what is happening? 有人可以帮我弄清楚发生了什么吗？ ` `

#define MAX_NUMP 16
using namespace std;
int NumProcs;

pthread_mutex_t   SyncLock; /* mutex */
pthread_cond_t    SyncCV; /* condition variable */
int               SyncCount; /* number of processors at the barrier so far */

pthread_mutex_t   ThreadLock; /* mutex */

// used only in solaris. use clock_gettime in linux
//hrtime_t          StartTime;
//hrtime_t          EndTime;  

struct timespec StartTime;
struct timespec EndTime;

void Barrier()
{
  int ret;

  pthread_mutex_lock(&SyncLock); /* Get the thread lock */
  SyncCount++;
  if(SyncCount == NumProcs) {
    ret = pthread_cond_broadcast(&SyncCV);
    assert(ret == 0);
  } else {
    ret = pthread_cond_wait(&SyncCV, &SyncLock); 
    assert(ret == 0);
  }
  pthread_mutex_unlock(&SyncLock);
}


/* The function which is called once the thread is allocated */
void* ThreadLoop(void* tmp)
{
  /* each thread has a private version of local variables */
  long threadId = (long) tmp; 
  int ret;
  int startTime, endTime;
  int count=0;
  /* ********************** Thread Synchronization*********************** */
  Barrier();

  /* ********************** Execute Job ********************************* */
  startTime = clock();
  for(int i=0;i<65536;i++)
    for(int j=0;j<1024;j++)
        count++;
  endTime = clock();
  printf("threadid:%ld, time:%d\n",threadId,endTime-startTime);
}


int main(int argc, char** argv)
{
  pthread_t*     threads;
  pthread_attr_t attr;
  int            ret;
  int            dx;

  if(argc != 2) {
    fprintf(stderr, "USAGE: %s <numProcesors>\n", argv[0]);
    exit(-1);
  }
  assert(argc == 2);
  NumProcs = atoi(argv[1]);
  assert(NumProcs > 0 && NumProcs <= MAX_NUMP);

  /* Initialize array of thread structures */
  threads = (pthread_t *) malloc(sizeof(pthread_t) * NumProcs);
  assert(threads != NULL);

  /* Initialize thread attribute */
  pthread_attr_init(&attr);
  pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM); // sys manages contention

  /* Initialize mutexs */
  ret = pthread_mutex_init(&SyncLock, NULL);
  assert(ret == 0);
  ret = pthread_mutex_init(&ThreadLock, NULL);
  assert(ret == 0);

  /* Init condition variable */
  ret = pthread_cond_init(&SyncCV, NULL);
  assert(ret == 0);
  SyncCount = 0;

  Count = 0;

  /* get high resolution timer, timer is expressed in nanoseconds, relative
   * to some arbitrary time.. so to get delta time must call gethrtime at
   * the end of operation and subtract the two times.
   */
  //StartTime = gethrtime();
  ret = clock_gettime(CLOCK_MONOTONIC, &StartTime);

  for(dx=0; dx < NumProcs; dx++) {
    /* ************************************************************
     * pthread_create takes 4 parameters
     *  p1: threads(output)
     *  p2: thread attribute
     *  p3: start routine, where new thread begins
     *  p4: arguments to the thread
     * ************************************************************ */
    ret = pthread_create(&threads[dx], &attr, ThreadLoop, (void*) dx);
    assert(ret == 0);

  }

  /* Wait for each of the threads to terminate */
  for(dx=0; dx < NumProcs; dx++) {
    ret = pthread_join(threads[dx], NULL);
    assert(ret == 0);
  }

  //EndTime = gethrtime();
  ret = clock_gettime(CLOCK_MONOTONIC, &EndTime);

  printf("Time = %ld nanoseconds\n", EndTime.tv_nsec - StartTime.tv_nsec);

  pthread_mutex_destroy(&ThreadLock);

  pthread_mutex_destroy(&SyncLock);
  pthread_cond_destroy(&SyncCV);
  pthread_attr_destroy(&attr);

  return 0;
}

Answer 1

Your observation is expected. 你的意见是预期的。

The main factors that usually impact this situation (worker spinning on local computation) are: 通常影响这种情况的主要因素（工人在本地计算上旋转）是：

The ratio nb_threads / nb_available_machine_cores 比率nb_threads / nb_available_machine_cores
The affinity of each thread 每个线程的亲和力

The optimal scenario here is when you have a ratio of 1, and each thread has a unique affinity with one of the core. 这里的最佳方案是当比率为1时，每个线程与其中一个核心具有唯一的亲和力。

The idea is to maximize each core throughput. 我们的想法是最大化每个核心吞吐量。 You can do that by having one and only one thread running on each core. 你可以通过在每个核心上运行一个且只有一个线程来实现。 If you increase the number of threads (ratio > 1), several threads will share the same core, forcing the kernel (through the task scheduler) to switch between the execution of each of them. 如果增加线程数（比率> 1），多个线程将共享同一个内核，迫使内核（通过任务调度程序）在每个线程的执行之间切换。 This is what you were observing. 这就是你所观察到的。

Each time the kernel has to operate such a switch, you pay for a context switch. 每次内核必须操作这样的开关时，您需要支付上下文切换。 It may become a noticeable overhead. 它可能会成为明显的开销。

Note: 注意：

You can use pthread_setaffinity to set the affinity of your threads. 您可以使用pthread_setaffinity设置线程的亲缘关系。

Answer 2

If you are running this in release mode (O3 compiler flag) then there are two things wrong with ThreadLoop(): 如果您在发布模式（O3编译器标志）中运行它，那么ThreadLoop（）有两个问题：

1) There is never any external usage of the 'count' result, so the compiler will omit computing it because it has no visible effect. 1）'count'结果从不外部使用，因此编译器将省略计算它，因为它没有可见的效果。

2) Even if there had been external usage of 'count' then the compiler will compute the result at compile time and simply emit the value directly. 2）即使外部使用了'count'，编译器也会在编译时计算结果并直接发出值。

You can see all this if you disassemble the binary. 如果你反汇编二进制文件，你可以看到所有这些。

You can declare 'volatile int count' to bypass both problems or you can compile with O1 compiler flag or do both. 您可以声明'volatile int count'来绕过这两个问题，或者您可以使用O1编译器标志进行编译或同时执行这两个操作。

The loop should scale pretty linearly with number of threads because there is no memory contention. 循环应该与线程数量相当线性地扩展，因为没有内存争用。 By the way, you should increase the loop iterations because I think the duration could be close to the noise ratio... 顺便说一句，你应该增加循环迭代，因为我认为持续时间可能接近噪音比...

随着线程的增加，Pthread程序运行速度变慢

问题描述

2 个解决方案

解决方案1
1 2014-05-30 20:45:18

解决方案2
0 2014-05-30 22:27:16

随着线程的增加，Pthread程序运行速度变慢

问题描述

2 个解决方案

解决方案1 1 2014-05-30 20:45:18

解决方案2 0 2014-05-30 22:27:16

解决方案1
1 2014-05-30 20:45:18

解决方案2
0 2014-05-30 22:27:16