简体   繁体   English

O3优化标志使并行处理中的加速变差

[英]O3 optimization flag making speed-ups worse in parallel processing

I am testing speed-ups for a parallel program in C using OpenMP. 我正在使用OpenMP测试C中并行程序的加速。 Using -O3 flag to compile the code with gcc, the execution time seems to be much smaller. 使用-O3标志用gcc编译代码,执行时间似乎要小得多。 However, I am consistently getting slower speed-ups for different thread numbers (2,4,8,16,24) when compared to the code compiled without optimization flags. 但是,与没有优化标志编译的代码相比,我对不同的线程数(2,4,8,16,24)的速度一直变慢。 How is this possible? 这怎么可能?

Here is more info about what I've found so far. 这里有更多关于我到目前为止所发现的信息。 I am writing a code for finding prime numbers based on the Sieve of Eratosthenes , and trying to optimize it with a parallel version using OpenMP. 我正在编写一个代码,用于根据EratosthenesSieve查找素数,并尝试使用OpenMP的并行版本对其进行优化。 Here is the code 这是代码

#include <stdio.h>
#include <stdlib.h>
#include <omp.h> 
#include <math.h> 

// ind2num: returns the integer (3<=odd<=numMax)
//      represented by index i at prime_numbers (0<=i<=maxInd)
#define ind2num(i)  (2*(i)+3)
// num2ind: retorns the index (0<=i<=maxInd) at prime_numbers
//      which represents the number (3<=odd<=numMax)
#define num2ind(i)  (((i)-3)/2)

// Sieve: find all prime numbers until ind2num(maxInd)
void Sieve(int *prime_numbers, long maxInd) {
    long maxSqrt;
    long baseInd;
    long base;
    long i;

    // square root of the largest integer (largest possible prime factor)
    maxSqrt = (long) sqrt((long) ind2num(maxInd));

    // first base
    baseInd=0;
    base=3;

    do {
        // marks as non-prime all multiples of base starting at base^2
        #pragma omp parallel for schedule (static)
        for (i=num2ind(base*base); i<=maxInd; i+=base) {
            prime_numbers[i]=0;
        }

        // updates base to next prime number
        for (baseInd=baseInd+1; baseInd<=maxInd; baseInd++)
            if (primos[baseInd]) {
                base = ind2num(baseInd);
                break;
            }
    }
    while (baseInd <= maxInd && base <= maxSqrt);

}

If I execute it to find all prime numbers smaller than 1000000000 (10^9), for example, I end up with the following execution times for different number of threads (1,2,4,8,16,24): 例如,如果我执行它以查找小于1000000000(10 ^ 9)的所有素数,那么对于不同数量的线程(1,2,4,8,16,24),我最终会得到以下执行时间:

Without -O3 | 没有-O3 | 56.31s | 56.31s | 28.87s | 28.87s | 21.77s | 21.77s | 11.19s | 11.19s | 6.13s | 6.13s | 4.50s | 4.50s |

With -O3 .... | 使用-O3 .... | 10.10s | 10.10s | 5.23s | 5.23s | 3.74s | 3.74s | 2.81s | 2.81s | 2.62s | 2.62s | 2.52s | 2.52s |

Here are the corresponding speed-ups: 以下是相应的加速:

Without -O3 | 没有-O3 | 1 | 1 | 1.95 | 1.95 | 2.59 | 2.59 | 5.03 | 5.03 | 9.19 | 9.19 | 12.51 | 12.51 |

With -O3 .... | 使用-O3 .... | 1 | 1 | 1.93 | 1.93 | 2.70 | 2.70 | 3.59 | 3.59 | 3.85 | 3.85 | 4.01 | 4.01 |

How come I keep getting lower speed-ups with -O3 flag? 为什么我用-O3标志继续降低速度?

There's a certain amount of memory bandwidth that the execution of the algorithm is going to require. 算法的执行需要一定量的内存带宽。 The less optimized the code, the more internal CPU machinations dominate the run time. 代码越不优化,内部CPU机制就越多地占据运行时间。 The more optimized the code, the more memory speed dominates the run time. 代码越优化,内存速度越大,运行时间就越大。

Since unoptimized code is less efficient, more cores can run it before the system memory bandwidth gets saturated. 由于未经优化的代码效率较低,因此在系统内存带宽饱和之前,可以运行更多内核。 Since the optimized code is more efficient, it gets its memory accesses done faster and thus puts a heavier load on the system memory bandwidth. 由于优化的代码效率更高,因此可以更快地完成内存访问,从而对系统内存带宽造成更大的负担。 This makes it less parallelizable. 这使得它不太可并行化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM