简体   繁体   English

CUDA - 分别并行减少偶数和奇数的和

[英]CUDA - Parallel Reduction Sum of Even and Odd Number Separately

I am trying to implement a parallel reduction sum of even and odd number Separately in CUDA.我正在尝试在 CUDA 中分别实现偶数和奇数的并行减少和。 I'm new in CUDA programming and I'm trying so hard but I can't find a solution.我是 CUDA 编程的新手,我很努力,但找不到解决方案。

I have for example the array: [5, 8, 0, -6, 2].例如,我有数组:[5, 8, 0, -6, 2]。 And the result need to be [4, 5] (Even: 8+0-6+2=4, Odd: 5=5).结果需要为 [4, 5](偶数:8+0-6+2=4,奇数:5=5)。 But the result of my following code is [8, 5].但是我下面代码的结果是 [8, 5]。

I think that my problem is in the notion of "shared" but I do not understand why.我认为我的问题在于“共享”的概念,但我不明白为什么。

__global__ void sumEvenOdd(int *a, int *b, int N){
    int column = blockIdx.x * blockIdx.x + threadIdx.x;

    __shared__ int s_data[2];

    if (column < N){ 
        if (a[column] % 2 == 0){
            s_data[0] += a[column];
        }
        else{
            s_data[1] += a[column];
        }
        __syncthreads();
        b[0] = s_data[0];
        b[1] = s_data[1];
    }
}

void initArray(int *a, int N){
    for (unsigned int i = 0; i < N; i++){
        a[i] = rand() % 100;
    }
}

void verify_result(int *a, int *b, int N){
    int *verify_b;
    verify_b = (int*)malloc(2 * sizeof(int));
    verify_b[0] = 0;
    verify_b[1] = 0;
    for (unsigned int i = 0; i < N; i++){
        if (a[i] % 2 == 0){
            verify_b[0] += a[i]; 
        }
        else{
            verify_b[1] += a[i];
        }
    }
    for (unsigned int i = 0; i < 2; i++){
        assert(verify_b[i] == b[i]);
    }
}

void printResult(int *a, int *b, int N){
    printf("\n");
    for (unsigned int i = 0; i < N; i++){
        printf("%d, ", a[i]);
    }
    printf("\n");
    for (unsigned int i = 0; i < 2; i++){
        printf("%d, ", b[i]);
    }
}

int main(){
 
    //Array sizes;
    int N = 5;
        
    //Size (in bytes) of matrix
    size_t bytes = N * sizeof(int);

    //Host pointers
    int *a, *b;
    
    // Allocate host memory
    a = (int*)malloc(bytes);
    b = (int*)malloc(2 * sizeof(int));

    // Initialize array
    initArray(a, N);

    // Device pointers
    int *d_a, *d_b;

    // Allocated device memory
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, 2 * sizeof(int));

    // Copy data to the device
    cudaMemcpy(d_a, a, bytes, cudaMemcpyHostToDevice);

    //Number of threads
    int THREADS = 128;

    //Number of blocks
    int BLOCKS = (N + THREADS - 1) / THREADS;

    // Launch kernel
    sumEvenOdd<<<BLOCKS, THREADS>>>(d_a, d_b, N);
    cudaDeviceSynchronize();

    // Copy back to the host
    cudaMemcpy(b, d_b, 2 * sizeof(int), cudaMemcpyDeviceToHost);

    // Check result
    verify_result(a, b, N);

    printResult(a, b, N);

    return 0;
}

you cannot just use你不能只使用

s_data[1] += a[column];

remember all units are going to execute this line at the same time, and store in the same position, so all threads are storing into s_data at the same time.请记住所有单元将同时执行此行,并存储在同一个 position 中,因此所有线程同时存储到 s_data 中。

instead you should use atomic add相反,您应该使用原子添加

atomicAdd(&s_data[1], a[column]);

and you should also be initializing s_data to zeros.并且您还应该将 s_data 初始化为零。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM