简体   繁体   中英

Convert sequential loop into parallel in C using pthreads

I would like to apply a pretty simple straightforward calculation on a n -by- d -dimensional array. The goal is to convert the sequential calculation to a parallel one using pthreads . My question is: what is the optimal way to split the problem ? How could I significantly reduce the execution time of my script ? I provide a sample sequential code in C and some thoughts on parallel implementations that I have already tried.

double * calcDistance(double * X ,int n, int d)
{
    //calculate and return an array[n-1] of all the distances
    //from the last point
    double *distances = calloc(n,sizeof(double));
    for(int i=0 ; i<n-1; i++)
    {
        //distances[i]=0;
        for (int j=0; j< d; j++)
        {

            distances[i] += pow(X[(j+1)*n-1]-X[j*n+i], 2);

        }
        distances[i] = sqrt(distances[i]);


    }
    return distances;
}

I provide a main() -caller function in order for the sample to be complete and testable:

#include <stdio.h>
#include <stdlib.h>

#define N 10 //00000
#define D 2        

int main()
{

    srand(time(NULL));

    //allocate the proper space for X
    double *X = malloc(D*N*(sizeof(double)));

    //fill X with numbers in space (0,1)
    for(int i = 0 ; i<N ; i++)
    {
        for(int j=0; j<D; j++)
        {
            X[i+j*N] = (double) (rand()  / (RAND_MAX + 2.0));
        }

    }
    X = calcDistances(X, N, D);

    return 0;
}
  • I have already tried utilizing pthreads asynchronously through the use of a global_index that is imposed to mutex and a local_index . Through the use of a while() loop, a local_index is assigned to each thread on each iteration. The local_index assignment depends on the global_index value at that time (both happening in a mutual exclusion block). The thread executes the computation on the distances[local_index] element. Unfortunately this implementation has lead to a much slower program with a x10 or x20 bigger execution time compared to the sequential one that is cited above.
  • Another idea is to predetermine and split the array (say to four equal parts) and assign the computation of each segment to a given pthread . I don't know if that's a common-efficient procedure though.

Your inner loop jumps all over array X with a mixture of strides that varies with the outer-loop iteration. Unless n and d are quite small, * this is likely to produce poor cache usage -- in the serial code, too, but parallelizing would amplify that effect. At least X is not written by the function, which improves the outlook. Also, there do not appear to be any data dependencies across iterations of the outer loop, which is good.

what is the optimal way to split the problem?

Probably the best available way would be to split outer-loop iterations among your threads. For T threads, have one perform iterations 0 ... (N / T) - 1 , have the second do (N / T)... (2 * N / T) - 1 , etc ..

How could I significantly reduce the execution time of my script?

The first thing I would do is use simple multiplication instead of pow to compute squares. It's unclear whether you stand to gain anything from parallelism.

  • I have already tried utilizing pthreads asynchronously through the use of a global_index that is imposed to mutex and a local_index. [...]

If you have to involve a mutex, semaphore, or similar synchronization object then the task is probably hopeless. Happily (maybe) there does not appear to be any need for that. Assigning outer-loop iterations to threads dynamically is way over-engineered for this problem. Statically assigning iterations to threads as I already described will remove the need for such synchronization, and since the cost of the inner loop does not look like it will vary much for different outer-loop iterations, there probably will not be too much inefficiency introduced that way.

Another idea is to predetermine and split the array (say to four equal parts) and assign the computation of each segment to a given pthread. I don't know if that's a common-efficient procedure though.

This sounds like what I described. It is one of the standard scheduling models provided by OMP, and one of the most efficient available for many problems, given that it does not itself require a mutex. It is somewhat sensitive to the relationship between the number of threads and the number of available execution units, however. For example, if you parallelize across five cores in a four-core machine, then one will have to wait to run until one of the others has finished -- best theoretical speedup 60%. Parallelizing the same computation across only four cores uses the compute resources more efficiently, for a best theoretical speedup of about 75%.


* If n and d are quite small, say anything remotely close to the values in the example driver program, then the overhead arising from parallelization has a good chance of overcoming any gains from parallel execution.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM