Recursive function using pthreads in C

Question

I have the following piece of code

    #include "stdio.h"
#include "stdlib.h"
#include <string.h>

#define MAXBINS 8


void swap_long(unsigned long int **x, unsigned long int **y){

  unsigned long int *tmp;
  tmp = x[0];
  x[0] = y[0];
  y[0] = tmp;

}

void swap(unsigned int **x, unsigned int **y){

  unsigned int *tmp;
  tmp = x[0];
  x[0] = y[0];
  y[0] = tmp;

}

void truncated_radix_sort(unsigned long int *morton_codes, 
              unsigned long int *sorted_morton_codes, 
              unsigned int *permutation_vector,
              unsigned int *index,
              int *level_record,
              int N, 
              int population_threshold,
              int sft, int lv){

  int BinSizes[MAXBINS] = {0};
  unsigned int *tmp_ptr;
  unsigned long int *tmp_code;

  level_record[0] = lv; // record the level of the node

  if(N<=population_threshold || sft < 0) { // Base case. The node is a leaf
    memcpy(permutation_vector, index, N*sizeof(unsigned int)); // Copy the pernutation vector
    memcpy(sorted_morton_codes, morton_codes, N*sizeof(unsigned long int)); // Copy the Morton codes 

    return;
  }
  else{

    // Find which child each point belongs to 
    int j = 0;
    for(j=0; j<N; j++){
      unsigned int ii = (morton_codes[j]>>sft) & 0x07;
      BinSizes[ii]++;
    }


    // scan prefix 
    int offset = 0, i = 0;
    for(i=0; i<MAXBINS; i++){
      int ss = BinSizes[i];
      BinSizes[i] = offset;
      offset += ss;
    }

    for(j=0; j<N; j++){
      unsigned int ii = (morton_codes[j]>>sft) & 0x07;
      permutation_vector[BinSizes[ii]] = index[j];
      sorted_morton_codes[BinSizes[ii]] = morton_codes[j];
      BinSizes[ii]++;
    }

    //swap the index pointers  
    swap(&index, &permutation_vector);

    //swap the code pointers 
    swap_long(&morton_codes, &sorted_morton_codes);

    /* Call the function recursively to split the lower levels */
    offset = 0; 
    for(i=0; i<MAXBINS; i++){

      int size = BinSizes[i] - offset;

      truncated_radix_sort(&morton_codes[offset], 
               &sorted_morton_codes[offset], 
               &permutation_vector[offset], 
               &index[offset], &level_record[offset], 
               size, 
               population_threshold,
               sft-3, lv+1);
      offset += size;  
    }


  } 
}

I tried to make this block

int j = 0;
    for(j=0; j<N; j++){
      unsigned int ii = (morton_codes[j]>>sft) & 0x07;
      BinSizes[ii]++;
    }

parallel by substituting it with the following

    int rc,j;
    pthread_t *thread = (pthread_t *)malloc(NTHREADS*sizeof(pthread_t));
    belong *belongs = (belong *)malloc(NTHREADS*sizeof(belong));
    pthread_mutex_init(&bin_mtx, NULL);
    for (j = 0; j < NTHREADS; j++){
        belongs[j].n = NTHREADS;
        belongs[j].N = N;
        belongs[j].tid = j;
        belongs[j].sft = sft;
        belongs[j].BinSizes = BinSizes;
        belongs[j].mcodes = morton_codes;
        rc = pthread_create(&thread[j], NULL, belong_wrapper, (void *)&belongs[j]);
    }

    for (j = 0; j < NTHREADS; j++){
        rc = pthread_join(thread[j], NULL);
    }

and defining these outside the recursive function

typedef struct{
    int n, N, tid, sft;
    int *BinSizes;
    unsigned long int *mcodes;
}belong;

pthread_mutex_t bin_mtx;

void * belong_wrapper(void *arg){
    int n, N, tid, sft, j;
    int *BinSizes;
    unsigned int ii;
    unsigned long int *mcodes;
    n = ((belong *)arg)->n;
    N = ((belong *)arg)->N;
    tid = ((belong *)arg)->tid;
    sft = ((belong *)arg)->sft;
    BinSizes = ((belong *)arg)->BinSizes;
    mcodes = ((belong *)arg)->mcodes;
    for (j = tid; j<N; j+=n){
        ii = (mcodes[j] >> sft) & 0x07;
        pthread_mutex_lock(&bin_mtx);
        BinSizes[ii]++;
        pthread_mutex_unlock(&bin_mtx);
    }

}

However it takes a lot more time than the serial one to execute... Why is this happening? What should I change?

Answer 1

Since you're using a single mutex to guard updates to the BinSizes array, you're still ultimately doing all the updates to this array sequentially: only one thread can call BinSizes[ii]++ at any given time. Basically you're still executing your function in sequence but incurring the extra overhead of creating and destroying threads.

There are several options I can think of for you (there are probably more):

do as @Chris suggests and make each thread update one portion of BinSizes . This might not be viable depending on the properties of the calculation you're using to compute ii .
Create multiple mutexes representing different partitions of BinSizes . For example, if BinSizes has 10 elements, you could create one mutex for elements 0-4, and another for elements 5-9, then use them in your thread something like so:
```
 if (ii < 5) { mtx_index = 0; } else { mtx_index = 1; } pthread_mutex_lock(&bin_mtx[mtx_index]); BinSizes[ii]++; pthread_mutex_unlock(&bin_mtx[mtx_index]); 
```
You could generalize this idea to any size of BinSizes and any range: Potentially you could have a different mutex for each array element. Of course then you're opening yourself up to the overhead of creating each of these mutexes, and the possibility of deadlock if someone tries to lock several of them at once etc...
Finally, you could abandon the idea of parallelizing this block altogether: as other users have mentioned using threads this way is subject to some level of diminishing returns. Unless your BinSizes array is very large, you might not see a huge benefit to parallelization even if you "do it right".

Answer 2

tl;dr - adding threads isn't a trivial fix for most problems. Yours isn't embarassingly parallelizable, and this code has hardly any actual concurrency.

You spin a mutex for every (cheap) integer operation on BinSizes . This will crush any parallelism, because all your threads are serialized on this.

The few instructions you can run concurrently (the for loop and a couple of operations on the morton code array) are much cheaper than (un)locking a mutex: even using an atomic increment (if available) would be more expensive than the un-synchronized part.

One fix would be to give each thread its own output array, and combine them after all tasks are complete.

Also, you create and join multiple threads per call. Creating threads is relatively expensive compared to computation, so it's generally recommended to create a long-lived pool of them to spread that cost.

Even if you do this, you need to tune the number of threads according to how many (free) cores do you have. If you do this in a recursive function, how many threads exist at the same time? Creating more threads than you have cores to schedule them on is pointless.

Oh, and you're leaking memory.

Recursive function using pthreads in C

Question

2 answers

solution1
1 ACCPTED 2014-11-05 19:44:43

solution2
0 2014-11-05 19:32:02

Recursive function using pthreads in C

Question

2 answers

solution1 1 ACCPTED 2014-11-05 19:44:43

solution2 0 2014-11-05 19:32:02

solution1
1 ACCPTED 2014-11-05 19:44:43

solution2
0 2014-11-05 19:32:02