简体   繁体   English

OpenMP C程序的运行速度比顺序代码慢

[英]OpenMP C program run slower than sequential code

I am a newbie to OpenMP, trying to parallelize Jarvis's algorithm. 我是OpenMP的新手,试图并行化Jarvis的算法。 However it turns out that the parallel program take 2-3 times longer compare to sequential code. 但是,事实证明,与顺序代码相比,并行程序要花费2-3倍的时间。

Is it that the problem itself cannot be parallelize? 问题本身不能并行化吗? Or there is something wrong in how i parallelize it. 或者在我如何并行化方面出了点问题。

This is my openMP program for the problem, with 2 parts being parallelize: 这是我针对该问题的openMP程序,其中两部分并行化:

#include <stdio.h>
#include <sys/time.h>
#include <omp.h>

typedef struct Point
{
int x, y;
} Point;

// To find orientation of ordered triplet (p, q, r).
// The function returns
// 0 for colinear points
// 1 as Clockwise
// 2 as Counterclockwise
int orientation(Point p, Point i, Point q)
{
int val = (i.y - p.y) * (q.x - i.x) -
          (i.x - p.x) * (q.y - i.y);
if (val == 0) return 0;  // colinear
return (val > 0)? 1: 2; // clock or counterclock wise
}

// Prints convex hull of a set of n points.
void convexHull(Point points[], int n)
{
// There must be at least 3 points
if (n < 3) return;

// Initialize array to store results
Point results[n];
int count = 0;

// Find the leftmost point
int l = 0,i;

#pragma omg parallel shared (n,l) private (i)
{
    #pragma omp for
    for (i = 1; i < n; i++)
    {
        #pragma omp critical
        {
            if (points[i].x < points[l].x)
            l = i;
        }
    }

}

// Start from leftmost point, keep moving counterclockwise
// until reach the start point again.
int p = l, q;
do
{
    // Add current point to result
    results[count]= points[p];
    count++;

    q = (p+1)%n;
    int k;

    #pragma omp parallel shared (p) private (k)
    {
        #pragma omp for 
        for (k = 0; k < n; k++)
        {
           // If i is more counterclockwise than current q, then
           // update i as new q
           #pragma omp critical
           {
            if (orientation(points[p], points[k], points[q]) == 2)
               q = k;
           }
        }       

    }

    // Now q is the most counterclockwise with respect to p
    // Set p as q for next iteration, to add q to result
    p = q;


} while (p != l);  // While algorithm does not return to first point

// Print Result
int j;
for (j = 0; j < count; j++){
  printf("(%d,%d)\n", results[j].x,results[j].y);
}

}

int main()
{
//declaration for start time, end time
//and total executions for the algorithm
struct timeval start, end;
int i, num_run = 100;

gettimeofday(&start,NULL);

Point points[] = {{0, 3}, {2, 2}, {1, 1}, {2, 1},
                    {3, 0}, {0, 0}, {3, 3}};

int n = sizeof(points)/sizeof(points[0]);

convexHull(points, n);

gettimeofday(&end,NULL);

int cpu_time_used = (((end.tv_sec - start.tv_sec) * 1000000) + (end.tv_usec 
- start.tv_usec));
printf("\n\nExecution time: %d ms\n", cpu_time_used);
return 0;
}

Tried to make the input subtantial enough by adding in these lines of code: 尝试通过添加以下代码行来使输入足够实际:

Point points[3000];
int i;
for(i=0;i<3000;i++) {
    points[i].x = rand()%100;
    points[i].y = rand()%100;
    int j;
    for(j=i+1;j<3000;j++) {
        if(points[i].x==points[j].x) {
            if(points[i].y==points[j].y) {
            i--; 
            break;
            }
        }
    }
}

But it crashes sometimes 但有时会崩溃

In the following piece of your code, the whole content of the parallel for loop is wrapped into a critical statement. 在您的以下代码中,parallel for循环的全部内容都包装在critical语句中。 This means that this part of the code will never be entered by more than on thread at a time. 这意味着这部分代码永远不会一次被线程输入。 Having multiple threads work one at a time will not go faster than if a single thread had gone through all iterations. 一次有多个线程工作不会比单个线程经过所有迭代要快。 But on top of that some time is lost in synchronization overhead (each thread must acquire a mutex before entering the critical section and release it afterwards). 但是最重​​要的是,同步开销浪费了一些时间(每个线程必须在进入关键部分之前获取一个互斥体,然后再释放它)。

int l = 0,i;
#pragma omp parallel shared (n,l) private (i)
{
    #pragma omp for
    for (i = 1; i < n; i++)
    {
        #pragma omp critical
        {
            if (points[i].x < points[l].x)
            l = i;
        }
    }
}

The serial code needs to be somewhat refactored for parallelization. 需要对串行代码进行某种程度的重构以实现并行化。 Reduction is often a good approach for simple operations: have each thread compute a partial result on one part of the iterations (eg partial minimum, partial sum) than merge all the results into a global one. 简化通常是简单操作的一种好方法:让每个线程在迭代的一部分上计算部分结果(例如部分最小值,部分和),而不是将所有结果合并为全局结果。 For supported operations, the #pragma omp for reduction(op:var) syntax can be used. 对于受支持的操作,可以使用#pragma omp for reduction(op:var)语法。 But in this case, the reduction has to be done manually. 但是在这种情况下,必须手动完成减少操作。

See how the following code relies on local variables to track the index of minimum x , then uses a single critical section to compute the global minimum index. 了解以下代码如何依赖局部变量来跟踪最小值x的索引,然后使用单个关键部分来计算全局最小值索引。

int l = 0,i;
#pragma omp parallel shared (n,l) private (i)
{
    int l_local = 0; //This variable is private to the thread

    #pragma omp for nowait
    for (i = 1; i < n; i++)
    {
        // This part of the code can be executed in parallel
        // since all write operations are on thread-local variables
        if (points[i].x < points[l_local].x)
            l_local = i;
    }

    // The critical section is entered only once by each thread
    #pragma omp critical
    {
    if (points[l_local].x < points[l].x)
        l = l_local;
    }

    #pragma omp barrier
    // a barrier is needed in case some more code follow
    // otherwise there is an implicit barrier at the end of the parallel region
}

The same principle should be applied to the second parallel loop, which suffer from the same issue of actually being entirely serialized by the critical statement. 第二个并行循环应采用相同的原理,该并行循环实际上受到critical语句完全序列化的困扰。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM