简体   繁体   English

OpenMP减慢了计算速度

[英]OpenMP slows down computations

I'm trying to parallelize a simple loop using OpenMP. 我正在尝试使用OpenMP并行化一个简单的循环。 Below is my code: 以下是我的代码:

#include <iostream>
#include <omp.h>
#include <time.h>
#define SIZE 10000000

float calculate_time(clock_t start, clock_t end) {
    return (float) ((end - start) / (double) CLOCKS_PER_SEC) * 1000;
}

void openmp_test(double * x, double * y, double * res, int threads){
    clock_t start, end;
    std::cout <<  std::endl << "OpenMP, " << threads << " threads" << std::endl;
    start = clock();
    #pragma omp parallel for num_threads(threads)
    for(int i = 0; i < SIZE; i++){
        res[i] = x[i] * y[i];
    }
    end = clock();

    for(int i = 1; i < SIZE; i++){
        res[0] += res[i];
    }
    std::cout << "time: " << calculate_time(start, end) << std::endl;
    std::cout << "result: " << res[0] << std::endl;
}

int main() {

    double *dbl_x = new double[SIZE];
    double *dbl_y = new double[SIZE];
    double *res = new double[SIZE];
    for(int i = 0; i < SIZE; i++){
        dbl_x[i] = i % 1000;
        dbl_y[i] = i % 1000;
    }

    openmp_test(dbl_x, dbl_y, res, 1);
    openmp_test(dbl_x, dbl_y, res, 1);
    openmp_test(dbl_x, dbl_y, res, 2);
    openmp_test(dbl_x, dbl_y, res, 4);
    openmp_test(dbl_x, dbl_y, res, 8);

    delete [] dbl_x;
    delete [] dbl_y;
    delete [] res;
    return 0;
}

I compile it as below 我编译如下

g++ -O3 -fopenmp main.cpp -o ompTest

However, after running the test on a Core-i7, I have the following results: 但是,在Core-i7上运行测试后,我得到以下结果:

OpenMP, 1 threads time: 31.468 result: 3.32834e+12 OpenMP,1个主题时间: 31.468结果:3.32834e + 12

OpenMP, 1 threads time: 18.663 result: 3.32834e+12 OpenMP,1个主题时间: 18.663结果:3.32834e + 12

OpenMP, 2 threads time: 34.393 result: 3.32834e+12 OpenMP,2个主题时间: 34.393结果:3.32834e + 12

OpenMP, 4 threads time: 56.31 result: 3.32834e+12 OpenMP,4个主题时间: 56.31结果:3.32834e + 12

OpenMP, 8 threads time: 108.54 result: 3.32834e+12 OpenMP,8个主题时间: 108.54结果:3.32834e + 12

I don't understand what I'm doing wrong? 我不明白我做错了什么? Why OpenMP slows down the calculations? 为什么OpenMP会减慢计算速度?

And also, why the first result is significantly slower than the second (both with 1 omp thread)? 而且,为什么第一个结果明显慢于第二个结果(两个都是1个omp线程)?

My test environment: Core i7-4702MQ CPU @ 2.20GHz, Ubuntu 18.04.2 LTS, g++ 7.4.0. 我的测试环境:Core i7-4702MQ CPU @ 2.20GHz,Ubuntu 18.04.2 LTS,g ++ 7.4.0。

Currently, you create threads but you give them all the same job. 目前,您创建线程但您给他们所有相同的工作。

I think, you forgot the "for" in the pragma, which makes the threads to divide the loop into parts. 我想,你忘了pragma中的“for”,这使得线程将循环分成几部分。

    #pragma omp parallel for num_threads(threads)

There are at least two things going on here. 这里至少有两件事。

  1. clock() measures elapsed processor time, which can be seen as a measure of the amount of work performed, whereas you want to measure elapsed wall time. clock()的措施已用处理器时间,它可以而要测量经过时间被视为执行的工作的量的量度。 See OpenMP time and clock() calculates two different results . 请参阅OpenMP time和clock()计算两个不同的结果

  2. Aggregate processor time should be higher in a parallel program than in a comparable serial program because parallelization adds overhead. 并行程序中的聚合处理器时间应该高于可比较的串行程序,因为并行化会增加开销。 The more threads, the more overhead, so speed improvement per added thread decreases with more threads, and can even become negative. 线程越多,开销越大,因此每个添加的线程的速度提高随着线程的增加而减少,甚至可能变为负数。

Compare to this variation on your code, which implements a more appropriate method to measure elapsed wall time: 与您的代码的这种变化相比,它实现了一种更合适的方法来测量经过的墙上时间:

float calculate_time(struct timespec start, struct timespec end) {
    long long start_nanos = start.tv_sec * 1000000000LL + start.tv_nsec;
    long long end_nanos = end.tv_sec * 1000000000LL + end.tv_nsec;
    return (end_nanos - start_nanos) * 1e-6f;
}

void openmp_test(double * x, double * y, double * res, int threads){
    struct timespec start, end;
    std::cout <<  std::endl << "OpenMP, " << threads << " threads" << std::endl;
    clock_gettime(CLOCK_MONOTONIC, &start);

    #pragma omp parallel num_threads(threads)
    for(int i = 0; i < SIZE; i++){
        res[i] = x[i] * y[i];
    }

    clock_gettime(CLOCK_MONOTONIC, &end);

    for(int i = 1; i < SIZE; i++){
        res[0] += res[i];
    }
    std::cout << "time: " << calculate_time(start, end) << std::endl;
    std::cout << "result: " << res[0] << std::endl;
}

The results for me are 结果对我来说是

 OpenMP, 1 threads time: 92.5535 result: 3.32834e+12 OpenMP, 2 threads time: 56.128 result: 3.32834e+12 OpenMP, 4 threads time: 59.8112 result: 3.32834e+12 OpenMP, 8 threads time: 78.9066 result: 3.32834e+12 

Note how the measured time with two threads is cut about in half, but adding more cores doesn't improve things much, and eventually starts trending back toward the single-thread time. 请注意两个线程的测量时间是如何减少一半,但添加更多内核并没有改善很多东西,并最终开始回到单线程时间。 * This exhibits the competing effects of performing more work concurrently on my four-core, eight-hyperthread machine, and of the increased overhead and resource contention associated with having more threads to coordinate. *这展示了在我的四核,八超线程机器上同时执行更多工作的竞争效果,以及与更多线程协调相关的增加的开销和资源争用。

Bottom line: throwing more threads at the task does not necessarily get you the result faster, and it rarely gets you a speedup proportional to the number of threads. 底线:在任务中抛出更多线程并不一定能让你获得更快的结果,而且很少能让你获得与线程数成比例的加速。


* Full disclosure: I cherry-picked these particular results from among those of several runs. *完全披露:我从几次运行中挑选了这些特定结果。 All showed similar trends, but the trend is particularly pronounced -- and therefore probably overemphasized -- in this one. 所有这些都表现出类似的趋势,但这一趋势特别明显 - 因此可能过分强调了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM