在C中使用pthread的递归函数

Question

I have the following piece of code 我有以下代码

    #include "stdio.h"
#include "stdlib.h"
#include <string.h>

#define MAXBINS 8


void swap_long(unsigned long int **x, unsigned long int **y){

  unsigned long int *tmp;
  tmp = x[0];
  x[0] = y[0];
  y[0] = tmp;

}

void swap(unsigned int **x, unsigned int **y){

  unsigned int *tmp;
  tmp = x[0];
  x[0] = y[0];
  y[0] = tmp;

}

void truncated_radix_sort(unsigned long int *morton_codes, 
              unsigned long int *sorted_morton_codes, 
              unsigned int *permutation_vector,
              unsigned int *index,
              int *level_record,
              int N, 
              int population_threshold,
              int sft, int lv){

  int BinSizes[MAXBINS] = {0};
  unsigned int *tmp_ptr;
  unsigned long int *tmp_code;

  level_record[0] = lv; // record the level of the node

  if(N<=population_threshold || sft < 0) { // Base case. The node is a leaf
    memcpy(permutation_vector, index, N*sizeof(unsigned int)); // Copy the pernutation vector
    memcpy(sorted_morton_codes, morton_codes, N*sizeof(unsigned long int)); // Copy the Morton codes 

    return;
  }
  else{

    // Find which child each point belongs to 
    int j = 0;
    for(j=0; j<N; j++){
      unsigned int ii = (morton_codes[j]>>sft) & 0x07;
      BinSizes[ii]++;
    }


    // scan prefix 
    int offset = 0, i = 0;
    for(i=0; i<MAXBINS; i++){
      int ss = BinSizes[i];
      BinSizes[i] = offset;
      offset += ss;
    }

    for(j=0; j<N; j++){
      unsigned int ii = (morton_codes[j]>>sft) & 0x07;
      permutation_vector[BinSizes[ii]] = index[j];
      sorted_morton_codes[BinSizes[ii]] = morton_codes[j];
      BinSizes[ii]++;
    }

    //swap the index pointers  
    swap(&index, &permutation_vector);

    //swap the code pointers 
    swap_long(&morton_codes, &sorted_morton_codes);

    /* Call the function recursively to split the lower levels */
    offset = 0; 
    for(i=0; i<MAXBINS; i++){

      int size = BinSizes[i] - offset;

      truncated_radix_sort(&morton_codes[offset], 
               &sorted_morton_codes[offset], 
               &permutation_vector[offset], 
               &index[offset], &level_record[offset], 
               size, 
               population_threshold,
               sft-3, lv+1);
      offset += size;  
    }


  } 
}

I tried to make this block 我试图做这个块

int j = 0;
    for(j=0; j<N; j++){
      unsigned int ii = (morton_codes[j]>>sft) & 0x07;
      BinSizes[ii]++;
    }

parallel by substituting it with the following 通过用以下内容代替

    int rc,j;
    pthread_t *thread = (pthread_t *)malloc(NTHREADS*sizeof(pthread_t));
    belong *belongs = (belong *)malloc(NTHREADS*sizeof(belong));
    pthread_mutex_init(&bin_mtx, NULL);
    for (j = 0; j < NTHREADS; j++){
        belongs[j].n = NTHREADS;
        belongs[j].N = N;
        belongs[j].tid = j;
        belongs[j].sft = sft;
        belongs[j].BinSizes = BinSizes;
        belongs[j].mcodes = morton_codes;
        rc = pthread_create(&thread[j], NULL, belong_wrapper, (void *)&belongs[j]);
    }

    for (j = 0; j < NTHREADS; j++){
        rc = pthread_join(thread[j], NULL);
    }

and defining these outside the recursive function 并在递归函数之外定义这些

typedef struct{
    int n, N, tid, sft;
    int *BinSizes;
    unsigned long int *mcodes;
}belong;

pthread_mutex_t bin_mtx;

void * belong_wrapper(void *arg){
    int n, N, tid, sft, j;
    int *BinSizes;
    unsigned int ii;
    unsigned long int *mcodes;
    n = ((belong *)arg)->n;
    N = ((belong *)arg)->N;
    tid = ((belong *)arg)->tid;
    sft = ((belong *)arg)->sft;
    BinSizes = ((belong *)arg)->BinSizes;
    mcodes = ((belong *)arg)->mcodes;
    for (j = tid; j<N; j+=n){
        ii = (mcodes[j] >> sft) & 0x07;
        pthread_mutex_lock(&bin_mtx);
        BinSizes[ii]++;
        pthread_mutex_unlock(&bin_mtx);
    }

}

However it takes a lot more time than the serial one to execute... Why is this happening? 但是，执行所需的时间比串行的要多得多。为什么会这样？ What should I change? 我应该改变什么？

Answer 1

Since you're using a single mutex to guard updates to the BinSizes array, you're still ultimately doing all the updates to this array sequentially: only one thread can call BinSizes[ii]++ at any given time. 由于您使用单个互斥量来保护对BinSizes数组的更新，因此您最终仍将继续对该数组进行所有更新：在任何给定时间，只有一个线程可以调用BinSizes[ii]++ 。 Basically you're still executing your function in sequence but incurring the extra overhead of creating and destroying threads. 基本上，您仍然按顺序执行函数，但是会产生创建和销毁线程的额外开销。

There are several options I can think of for you (there are probably more): 我可以为您想到几种选择（可能还有更多选择）：

do as @Chris suggests and make each thread update one portion of BinSizes . 按照@Chris的建议进行操作，并使每个线程更新BinSizes一部分。 This might not be viable depending on the properties of the calculation you're using to compute ii . 根据您用于计算ii的计算的属性，这可能不可行。
Create multiple mutexes representing different partitions of BinSizes . 创建代表BinSizes不同分区的多个互斥锁。 For example, if BinSizes has 10 elements, you could create one mutex for elements 0-4, and another for elements 5-9, then use them in your thread something like so: 例如，如果BinSizes有10个元素，则可以为元素0-4创建一个互斥体，为元素5-9创建另一个互斥体，然后在线程中使用它们，如下所示：
```
 if (ii < 5) { mtx_index = 0; } else { mtx_index = 1; } pthread_mutex_lock(&bin_mtx[mtx_index]); BinSizes[ii]++; pthread_mutex_unlock(&bin_mtx[mtx_index]); 
```
You could generalize this idea to any size of BinSizes and any range: Potentially you could have a different mutex for each array element. 您可以将这个想法推广到任何大小的BinSizes和任何范围：可能每个数组元素都有一个不同的互斥量。 Of course then you're opening yourself up to the overhead of creating each of these mutexes, and the possibility of deadlock if someone tries to lock several of them at once etc... 当然，那么您将需要承担创建每个互斥锁的开销，并且如果有人试图一次锁定其中几个互锁，则可能导致死锁等。
Finally, you could abandon the idea of parallelizing this block altogether: as other users have mentioned using threads this way is subject to some level of diminishing returns. 最后，您可以完全放弃并行处理此块的想法：正如其他用户所提到的那样，使用线程会受到一定程度的收益递减的影响。 Unless your BinSizes array is very large, you might not see a huge benefit to parallelization even if you "do it right". 除非您的BinSizes数组很大，否则即使您“正确执行”，也可能看不到并行化的巨大好处。

Answer 2

tl;dr - adding threads isn't a trivial fix for most problems. tl; dr-对于大多数问题，添加线程并不是一个简单的解决方案。 Yours isn't embarassingly parallelizable, and this code has hardly any actual concurrency. 您的代码不可并行化，并且此代码几乎没有任何实际的并发性。

You spin a mutex for every (cheap) integer operation on BinSizes . 您可以对BinSizes上的每个（便宜的）整数操作旋转互斥BinSizes 。 This will crush any parallelism, because all your threads are serialized on this. 这将破坏任何并行性，因为所有线程都已在此序列化。

The few instructions you can run concurrently (the for loop and a couple of operations on the morton code array) are much cheaper than (un)locking a mutex: even using an atomic increment (if available) would be more expensive than the un-synchronized part. 少数指令可以同时运行（for循环和莫顿码阵列上的几个操作）比（UN）锁定一个互斥体便宜得多：即使使用原子增量（如果有的话）会比非更贵同步部分。

One fix would be to give each thread its own output array, and combine them after all tasks are complete. 一种解决方法是为每个线程提供自己的输出数组，并在完成所有任务后将它们组合在一起。

Also, you create and join multiple threads per call. 此外，您将为每个调用创建并加入多个线程。 Creating threads is relatively expensive compared to computation, so it's generally recommended to create a long-lived pool of them to spread that cost. 与计算相比，创建线程的成本相对较高，因此通常建议创建一个长期存在的线程池以分散该成本。

Even if you do this, you need to tune the number of threads according to how many (free) cores do you have. 即使执行此操作，也需要根据您拥有多少个（空闲）内核来调整线程数。 If you do this in a recursive function, how many threads exist at the same time? 如果在递归函数中执行此操作，则同时存在几个线程？ Creating more threads than you have cores to schedule them on is pointless. 创建的线程多于您要安排内核的线程，这是没有意义的。

Oh, and you're leaking memory. 哦，您正在泄漏内存。

在C中使用pthread的递归函数

问题描述

2 个解决方案

解决方案1
1 已采纳 2014-11-05 19:44:43

解决方案2
0 2014-11-05 19:32:02

在C中使用pthread的递归函数

问题描述

2 个解决方案

解决方案1 1 已采纳 2014-11-05 19:44:43

解决方案2 0 2014-11-05 19:32:02

解决方案1
1 已采纳 2014-11-05 19:44:43

解决方案2
0 2014-11-05 19:32:02