简体   繁体   中英

O3 optimization flag making speed-ups worse in parallel processing

I am testing speed-ups for a parallel program in C using OpenMP. Using -O3 flag to compile the code with gcc, the execution time seems to be much smaller. However, I am consistently getting slower speed-ups for different thread numbers (2,4,8,16,24) when compared to the code compiled without optimization flags. How is this possible?

Here is more info about what I've found so far. I am writing a code for finding prime numbers based on the Sieve of Eratosthenes , and trying to optimize it with a parallel version using OpenMP. Here is the code

#include <stdio.h>
#include <stdlib.h>
#include <omp.h> 
#include <math.h> 

// ind2num: returns the integer (3<=odd<=numMax)
//      represented by index i at prime_numbers (0<=i<=maxInd)
#define ind2num(i)  (2*(i)+3)
// num2ind: retorns the index (0<=i<=maxInd) at prime_numbers
//      which represents the number (3<=odd<=numMax)
#define num2ind(i)  (((i)-3)/2)

// Sieve: find all prime numbers until ind2num(maxInd)
void Sieve(int *prime_numbers, long maxInd) {
    long maxSqrt;
    long baseInd;
    long base;
    long i;

    // square root of the largest integer (largest possible prime factor)
    maxSqrt = (long) sqrt((long) ind2num(maxInd));

    // first base
    baseInd=0;
    base=3;

    do {
        // marks as non-prime all multiples of base starting at base^2
        #pragma omp parallel for schedule (static)
        for (i=num2ind(base*base); i<=maxInd; i+=base) {
            prime_numbers[i]=0;
        }

        // updates base to next prime number
        for (baseInd=baseInd+1; baseInd<=maxInd; baseInd++)
            if (primos[baseInd]) {
                base = ind2num(baseInd);
                break;
            }
    }
    while (baseInd <= maxInd && base <= maxSqrt);

}

If I execute it to find all prime numbers smaller than 1000000000 (10^9), for example, I end up with the following execution times for different number of threads (1,2,4,8,16,24):

Without -O3 | 56.31s | 28.87s | 21.77s | 11.19s | 6.13s | 4.50s |

With -O3 .... | 10.10s | 5.23s | 3.74s | 2.81s | 2.62s | 2.52s |

Here are the corresponding speed-ups:

Without -O3 | 1 | 1.95 | 2.59 | 5.03 | 9.19 | 12.51 |

With -O3 .... | 1 | 1.93 | 2.70 | 3.59 | 3.85 | 4.01 |

How come I keep getting lower speed-ups with -O3 flag?

There's a certain amount of memory bandwidth that the execution of the algorithm is going to require. The less optimized the code, the more internal CPU machinations dominate the run time. The more optimized the code, the more memory speed dominates the run time.

Since unoptimized code is less efficient, more cores can run it before the system memory bandwidth gets saturated. Since the optimized code is more efficient, it gets its memory accesses done faster and thus puts a heavier load on the system memory bandwidth. This makes it less parallelizable.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM