CUDA kernel for determining primes slower than OpenMP code - how can I optimize it?

Question

To practice programming with CUDA in C++. I did an exercise which consists in displaying the prime numbers less than N. For each code I commented out the last display loop to compare only the calculation times.

The Makefile:

all: sum sum_cpu

nothing: 
    g++ -O3 -std=c++17 -o premier.exe premier.cpp -Wall
cpu:
    g++ -O3 -std=c++17 -o cpu_premier.exe premier.cpp -Wall -fopenmp  
gpu:
    nvcc  --compile --compiler-options -O3 -o gpu_premier.o gpu_premier.cu -gencode  arch=compute_50,code=sm_50
    nvcc --link --compiler-options -O3 -o gpu_premier.exe gpu_premier.o
clear: 
    rm *.exe *.o

Here is my code to parallelize with openMP which runs in 1,306s:

#include <math.h>
#include <iostream>

const int N = 2<<22;
bool * premiers;

bool est_premier(int nbr){

    if ( nbr==1 || nbr == 0) return false;
    else if (nbr == 2) return true;
    else if (nbr % 2 == 0) return false;
    else {
        for (int i=3;i<=sqrt(nbr);++i){
            if (nbr % i == 0){
                return false;
            }
        }
    }
    return true;
}


int main(){
    premiers = new bool[N+1];

    # pragma omp parallel for
    for (int i = 0;i<N;++i){
        premiers[i] = est_premier(i);
    }

    /*
    for (int i = 0;i<N;++i){
        if (premiers[i])
            std::cout<<i<<",";
       
    } std::cout<<std::endl;
    */


    delete[] premiers;
}

Here is the corresponding cuda code which runs in 1,613s:


#include <cuda.h>
#include <iostream>
const int N = 2<<22;
bool * premiers_cpu;
bool * premiers_gpu;



__device__
bool est_premier(int nbr){

    if ( nbr==1 || nbr == 0) return false;
    else if (nbr == 2) return true;
    else if (nbr % 2 == 0) return false;
    else {
        for (int i=3;i * i <= nbr ;++i){
            if (nbr % i == 0){
                return false;
            }
        }
    }
    return true;
}


__global__ void kernel_premier(bool * premiers, int size){
    int gtid =  blockIdx.x * blockDim.x  + threadIdx.x ;

    while(gtid < size){
        premiers[gtid] = est_premier(gtid);
        gtid += blockDim.x * gridDim.x;
    }


}




int main(){

    bool * premiers_cpu = new bool[N];
   

    dim3 block (256,1);
    dim3 grid (2048,1,1);

    cudaMalloc(( void **) &premiers_gpu, N * sizeof(bool));
    cudaMemcpy(premiers_gpu,premiers_cpu,N * sizeof(bool),cudaMemcpyHostToDevice);

    kernel_premier<<<grid,block>>>(premiers_gpu,N);

    cudaMemcpy(premiers_cpu,premiers_gpu,N * sizeof(bool),cudaMemcpyDeviceToHost);

   
    /*
    for (int i = 0;i<N;++i){
        if (premiers_cpu[i])
            std::cout<<i<<",";
       
    } std::cout<<std::endl;
    */

    delete[] premiers_cpu;
    cudaFree(premiers_gpu);
    


}

Intuitively, I thought that lowering the size of the grid and increasing the number of blocks per grid would make the program more efficient, but it's the opposite. Here my program in cuda is less efficient than my program with OpenMP how to explain it and how to fix it?

Answer 1

Well, first, the trivial observation of: You shouldn't even compute this at run-time. Just set the result at compile-time, for a run-time which is nothing but printing.

Also, a second semi-trivial observation: Make sure you don't time the printing. Terminal I/O is quite slow, Maybe that's what's taking most of your time? with both alternatives?

Finally, and more to the point: Divergence of threads kills your performance. Half of the threads don't even go into the loop, another 1/6 only do a single iteration - while on average, a thread iterates Theta(log((N)) times.

But addressing this is quite untrivial. Your best "choice" would be switching your algorithm altogether (perhaps even on the CPU), and using something like Erathostenes' sieve : Start with factors, not potential primes, and mark the non-primes. That's not a trivial parallelization challenge either, actually.

Anyway, with the algorith, as-is, consider a more careful separation. Say... instead of returning early for even nbr s - don't iterate them to begin with: Don't check gtid for primality, but rather 2*gtid + 1 ; and halve the size of your grid. Of course, you can do the same for the CPU - but the GPU will benefit more.

More generally, consider setting your block size to a value for which you can rule out some non-primes to begin with and only check the suspects. For example, if the block size is 5 (not a good idea. just an illustration) then you can check primality for length-10 runs of numbers? How come, Because within the sequence 10k,10k+1,10k+2,10k+3,10k+4,10k+5,10k+6,10k+7,10k+8,10k+9 - the 1'st, 3'rd, 5'th; 7'th and 9th numbers are multiples of 2. and the 6'th (and 1'th) is a multiple of 5, So, half the threads. and some reduction in the divergence. Take this further (runs of 2 3 5 values, or 2 3 5*7) with a small constant lookup table for choosing the value to check, and you will see some improvement.

Caveat: This might still not be your worst enemy. Always use profiling (eg using nsight-compute) to see where you're spending most of your time and how.

CUDA kernel for determining primes slower than OpenMP code - how can I optimize it?

Question

1 answers

solution1
2 ACCPTED 2023-01-31 18:10:52

CUDA kernel for determining primes slower than OpenMP code - how can I optimize it?

Question

1 answers

solution1 2 ACCPTED 2023-01-31 18:10:52

solution1
2 ACCPTED 2023-01-31 18:10:52