多线程如何使speeedup的因子大于内核数？

Question

I am using pthreads with gcc. 我在gcc中使用pthreads。 The simple code example takes the number of threads "N" as a user-supplied input. 简单的代码示例将线程数“ N”作为用户提供的输入。 It splits up a long array into N roughly equally sized subblocks. 它将一个长数组拆分为N个大小大致相等的子块。 Each subblock is written into by individual threads. 每个子块都由单独的线程写入。

The dummy processing for this example really involves sleeping for a fixed amount of time for each array index and then writing a number into that array location. 此示例的虚拟处理实际上涉及为每个数组索引休眠固定的时间，然后将数字写入该数组位置。

Here's the code: 这是代码：

/******************************************************************************
* FILE: threaded_subblocks_processing
* DESCRIPTION:
* We have a bunch of parallel processing to do and store the results in a
* large array. Let's try to use threads to speed it up.
******************************************************************************/
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <math.h>

#define BIG_ARR_LEN 10000

typedef struct thread_data{
  int start_idx;
  int end_idx;
  int id;
} thread_data_t;

int big_result_array[BIG_ARR_LEN] = {0};

void* process_sub_block(void *td)
{
   struct thread_data *current_thread_data = (struct thread_data*)td;
   printf("[%d] Hello World! It's me, thread #%d!\n", current_thread_data->id, current_thread_data->id);
   printf("[%d] I'm supposed to work on indexes %d through %d.\n", current_thread_data->id, 
       current_thread_data->start_idx, 
       current_thread_data->end_idx-1);

   for(int i=current_thread_data->start_idx; i<current_thread_data->end_idx; i++)
   {
       int retval = usleep(1000.0*1000.0*10.0/BIG_ARR_LEN);
       if(retval)
       {
         printf("sleep failed");
       }

       big_result_array[i] = i;
   }

   printf("[%d] Thread #%d done, over and out!\n", current_thread_data->id, current_thread_data->id);
   pthread_exit(NULL);
}

int main(int argc, char *argv[])
{
   if (argc!=2)
   {
     printf("usage: ./a.out number_of_threads\n");
     return(1);
   }

   int NUM_THREADS = atoi(argv[1]);

   if (NUM_THREADS<1)
   {
     printf("usage: ./a.out number_of_threads (where number_of_threads is at least 1)\n");
     return(1);
   }

   pthread_t *threads = malloc(sizeof(pthread_t)*NUM_THREADS);
   thread_data_t *thread_data_array = malloc(sizeof(thread_data_t)*NUM_THREADS);

   int block_size = BIG_ARR_LEN/NUM_THREADS;
   for(int i=0; i<NUM_THREADS-1; i++)
   {
     thread_data_array[i].start_idx = i*block_size;
     thread_data_array[i].end_idx = (i+1)*block_size;
     thread_data_array[i].id = i;
   }
   thread_data_array[NUM_THREADS-1].start_idx = (NUM_THREADS-1)*block_size;
   thread_data_array[NUM_THREADS-1].end_idx = BIG_ARR_LEN;
   thread_data_array[NUM_THREADS-1].id = NUM_THREADS;

   int ret_code;
   long t;
   for(t=0;t<NUM_THREADS;t++){
     printf("[main] Creating thread %ld\n", t);
     ret_code = pthread_create(&threads[t], NULL, process_sub_block, (void *)&thread_data_array[t]);
     if (ret_code){
       printf("[main] ERROR; return code from pthread_create() is %d\n", ret_code);
       exit(-1);
       }
     }

   printf("[main] Joining threads to wait for them.\n");
   void* status;
   for(int i=0; i<NUM_THREADS; i++)
   {
     pthread_join(threads[i], &status);
   }

   pthread_exit(NULL);
}

and I compile it with 我用它编译

gcc -pthread threaded_subblock_processing.c

and then I call it from command line like so: 然后从命令行这样调用它：

$ time ./a.out 4

I see a speed up when I increase the number of threads. 当我增加线程数时，我看到速度有所提高。 With 1 thread the process takes just a little over 10 seconds. 使用1个线程，该过程只需要10秒钟多一点。 This makes sense because I sleep for 1000 usec per array element, and there are 10,000 array elements. 这是有道理的，因为我每个阵列元素要睡1000个usec，并且有10,000个阵列元素。 Next when I go to 2 threads, it goes down to a little over 5 seconds, and so on. 接下来，当我进入2个线程时，它下降到5秒多一点，依此类推。

What I don't understand is that I get a speed-up even after my number of threads exceeds the number of cores on my computer! 我不明白的是，即使我的线程数超过了计算机上的内核数，我仍然可以加快速度！ I have 4 cores, so I was expecting no speed-up for >4 threads. 我有4个核心，因此我期望> 4个线程不会加速。 But, surprisingly, when I run 但是，令人惊讶的是，当我跑步时

$ time ./a.out 100

I get a 100x speedup and the processing completes in ~0.1 seconds! 我得到了100倍的加速，处理过程在约0.1秒内完成！ How is this possible? 这怎么可能？

Answer 1

Some general background 一些一般背景

A program's progress can be slowed by many things, but, in general, you can slow spots (otherwise known as hot spots) into two categories: 程序的进度可能会因许多原因而减慢，但是通常，您可以将斑点（也称为热点）分为两类：

CPU Bound : In this case, the processor is doing some heavy number crunching (like trigonometric functions). CPU限制 ：在这种情况下，处理器正在执行大量运算（例如三角函数）。 If all the CPU's cores are engaged in such tasks, other processes must wait. 如果所有CPU内核都参与了此类任务，则其他进程必须等待。
Memory bound : In this case, the processor is waiting for information to be retrieved from the hard disk or RAM. 内存限制 ：在这种情况下，处理器正在等待从硬盘或RAM中检索信息。 Since these are typically orders of magnitude slower than the processor, from the CPU's perspective this takes forever . 由于这些通常比处理器慢几个数量级，因此从CPU的角度来看，这是永远的 。

But you can also imagine other situations in which a process must wait, such as for a network response. 但是您也可以想象进程必须等待的其他情况，例如网络响应。

In many of these memory-/network-bound situations, it is possible to put a thread "on hold" while the memory crawls towards the CPU and do other useful work in the meantime. 在许多此类受内存/网络限制的情况下，可以在内存向CPU爬网的同时将线程置于“保留”状态，同时进行其他有用的工作。 If this is done well then a multi-threaded program can well out-perform its single-threaded equivalent. 如果做得好，那么多线程程序可以很好地胜过其单线程等效程序。 Node.js makes use of such asynchronous programming techniques to achieve good performance. Node.js利用此类异步编程技术来实现良好的性能。

Here's a handy depiction of various latencies: 这是各种延迟的方便描述：

Your question 你的问题

Now, getting back to your question: you have multiple threads going, but they are performing neither CPU-intensive nor memory-intensive work: there's not much there to take up time. 现在，回到您的问题：您有多个线程在运行，但是它们既不执行CPU密集型工作，也不执行内存密集型工作：没有太多时间可以花时间。 In fact, the sleep function is essentially telling the operating system that no work is being done. 实际上，睡眠功能实际上是在告诉操作系统没有任何工作在做。 In this case, the OS can do work in other threads while your threads sleep. 在这种情况下，操作系统可以在线程休眠时在其他线程中运行。 So, naturally, the apparent performance increases significantly. 因此，自然地，表观性能会大大提高。

Note that for low-latency applications, such as MPI, busy waiting is sometimes used instead of a sleep function. 请注意，对于诸如MPI之类的低延迟应用程序，有时会使用繁忙等待而不是睡眠功能。 In this case, the program goes into a tight loop and repeatedly checks a condition. 在这种情况下，程序进入紧密循环并反复检查条件。 Externally, the effect looks similar, but sleep uses no CPU while the busy wait uses ~100% of the CPU. 从外部看，效果类似，但是睡眠不使用CPU，而繁忙的等待则使用约100％的CPU。

多线程如何使speeedup的因子大于内核数？

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-07-06 22:40:24

多线程如何使speeedup的因子大于内核数？

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-07-06 22:40:24

解决方案1
2 已采纳 2017-07-06 22:40:24