在 C 与 CUDA 中并行化基数排序的问题

Question

I am trying to implement a radix sort algorithm for exercise using CUDA in C to be able to parallelize it;我正在尝试使用 C 中的 CUDA 来实现一个基数排序算法，以便能够并行化它； the code is as follows:代码如下：

#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <time.h>
#include <cuda_profiler_api.h>

int* populateArray(int * arr, int n){
    int j=0;
    for (int i=n; i>0 ; i--){
        arr[j]= i;
        j=j+1;
    }
    return arr;
    }

void printArray(int * array, int size){
    printf("Ordered array:\n");
    for (int j=0; j<size; j++){
        printf("%d ", array[j]);
    }
    printf("\n");
}

__device__  int getMax(int* array, int n) {
  int max = array[0];
  for (int i = 1; i < n; i++)
    if (array[i] > max)
      max = array[i];
  return max;
}

__device__ void countingSort(int* array,int size,int digit, int index, int* output) {
  int count[10]={0};

  for (int i = 0; i < size; i++)
    count[(array[i] / digit) % 10]++;

  for (int i = 1; i < 10; i++)
    count[i] += count[i - 1];

  for (int i = size - 1; i >= 0; i--) {
    output[count[(array[i] /digit) % 10] - 1] = array[i];
    count[(array[i] / digit) % 10]--;
    }

  for (int i = 0; i < size; i++)
    array[i]= output[i];
  }


__global__ void radixsort(int* array, int size, int* output) {
  // define index
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  
  // check that the thread is not out of the vector boundary
  if (index >= size) return;

  int max = getMax(array, size);
  for (int digit = 1; max / digit > 0; digit *= 10){
    countingSort(array, size, digit, index, output);
    }
 
}

int main(int argc, char *argv[]) {
    // Init array
    int n = 1000;
    int* array_h = (int*) malloc(sizeof(int) * n);
    populateArray(array_h, n);
    
    // allocate memory on device
    int* array_dev;
    cudaMalloc((void**)&array_dev, n*sizeof(int));
    int* output;
    cudaMalloc((void**)&output, n*sizeof(int));

    // copy data from host to device
    cudaMemcpy(array_dev, array_h, n*sizeof(int), cudaMemcpyHostToDevice);

    dim3 block(32);
    dim3 grid((n-1)/block.x + 1);
    printf("Number of threads for each block: %d\n", block.x);
    printf("Number of blocks in the grid: %d\n", grid.x);

    // Create start and stop CUDA events 
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start);

    cudaError_t mycudaerror;
    mycudaerror = cudaGetLastError();
    // define the execution configuration
    radixsort<<<grid,block>>>(array_dev, n, output);

    // device synchronization and cudaGetLastError call
    mycudaerror = cudaGetLastError();
    if(mycudaerror != cudaSuccess)  {
      fprintf(stderr,"%s\n",cudaGetErrorString(mycudaerror)) ;
      //printf("Error in kernel!");
      exit(1);
    }

    // event record, synchronization, elapsed time and destruction
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float elapsed;
    cudaEventElapsedTime(&elapsed, start, stop);
    elapsed = elapsed/1000.f; // convert to seconds
    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    printf("Number of elements in the array: %d\n", n);
    printf("Kernel elapsed time: %.5f s\n", elapsed);

    // copy back results from device
    cudaMemcpy(array_h, array_dev, n*sizeof(int), cudaMemcpyDeviceToHost);

    // print ordered array
    printArray(array_h, n);
    
    // free resources on device
    cudaFree(array_dev);
    cudaFree(output);

    // free resources on host
    free(array_h);

    cudaProfilerStop();
    return 0; 
}

In order to run it, I'm using Google Colab.为了运行它，我使用的是 Google Colab。 The maximum number of threads for each block is fixed at 32 (grid variable), while the number of blocks used is calculated in the main based on how many elements need to be sorted (block variable).每个块的最大线程数固定为 32（网格变量），而使用的块数在 main 中根据需要排序的元素数（块变量）计算。

The problem arises when I start to change the number of elements inside the array to be sorted (the variable "n" present in main) as once a certain threshold is exceeded, the sorting is no longer performed correctly.当我开始更改要排序的数组中的元素数量（main 中存在的变量“n”）时，就会出现问题，因为一旦超过某个阈值，排序将不再正确执行。

In order to get more information about this wrong execution, I also used the commands为了获得有关此错误执行的更多信息，我还使用了命令
cuda-memcheck and nvcc -lineinfo and I found that the error occurs due to this line of code in the kernel: cuda-memcheck和nvcc -lineinfo我发现错误是由于 kernel 中的这行代码引起的：

output[count[(array[i] /digit) % 10] - 1] = array[i];

Making several attempts, the error mainly seems to be in the computation of the index of the "output" array I am trying to write to;进行了几次尝试，错误似乎主要在于计算我要写入的“输出”数组的索引； however, this error does not occur when, for example, I try to sort 32 or 64 elements.但是，例如，当我尝试对 32 或 64 个元素进行排序时，不会发生此错误。 I would therefore like to know if I am doing something wrong in the code or if, simply, the radix sort is not parallelizable in the way I am trying.因此，我想知道我是否在代码中做错了什么，或者简单地说，基数排序在我尝试的方式中是否不可并行化。 I know that having each thread do the sorting without using the index relative to each thread is very heavy computationally speaking, but first I wanted to try to solve this problem and then try to optimize the code in general.我知道让每个线程在不使用相对于每个线程的索引的情况下进行排序在计算上是非常繁重的，但首先我想尝试解决这个问题，然后尝试总体上优化代码。

Various proven approaches include:各种经过验证的方法包括：

use of atomic instructions;使用原子指令；
declare the "output" array inside the kernel and don't pass it as a pointer from main;在 kernel 中声明“输出”数组，不要将其作为 main 的指针传递；
use of only one block of threads for the calculation of the sorted array (an approach that strangely seems to work, but which is useless given the dynamic allocation of the number of blocks)仅使用一个线程块来计算排序数组（一种奇怪的方法似乎有效，但考虑到块数的动态分配，这是无用的）

Thanks for any replies.感谢您的任何回复。

Answer 1

You seem to already understand that this is not the way to approach this problem:您似乎已经明白这不是解决此问题的方法：

I know that having each thread do the sorting without using the index relative to each thread is very heavy computationally speaking, but first I wanted to try to solve this problem and then try to optimize the code in general.我知道让每个线程在不使用相对于每个线程的索引的情况下进行排序在计算上是非常繁重的，但首先我想尝试解决这个问题，然后尝试总体上优化代码。

Nearly any code that is serial in nature can be dropped in a CUDA kernel, run in a single thread, and it should produce the same result.几乎任何本质上是串行的代码都可以放在 CUDA kernel 中，在单线程中运行，它应该产生相同的结果。 However this is not the way to write CUDA codes;然而，这不是编写 CUDA 代码的方法； the performance will be dismal.表现会很惨淡。

In many cases, algorithms that are serial in nature are not readily adaptable to parallelization.在许多情况下，本质上是串行的算法不容易适应并行化。 Your sorting approach is one of them.您的排序方法就是其中之一。 A typical method to (efficiently) parallelize radix sort is here .一种（有效）并行化基数排序的典型方法在这里。

Nevertheless, dropping a serial algorithm into a single thread "should" work.尽管如此，将串行算法放入单个线程“应该”工作。 The problem arises here when you run the same serial code (unnecessarily) in many threads.当您在多个线程中（不必要地）运行相同的串行代码时，就会出现问题。

If all threads were perfectly in lockstep, they would all be doing exactly the same thing, in the same instruction or clock cycle.如果所有线程都完美地步调一致，那么它们都会在相同的指令或时钟周期内做完全相同的事情。 However GPUs don't work that way.但是 GPU 不是这样工作的。 About the largest "lockstep" you can witness is at the warp level (32 threads in a threadblock).您可以看到的最大“锁步”是在经线级别（线程块中有 32 个线程）。 As soon as you go to multiple warps, whether they are in the same threadblock or different threadblocks, then you will have threads at different points in your serial code, effectively stepping on each other.只要你 go 到多个 warp，无论它们是在同一个线程块还是不同的线程块中，那么你的串行代码中的不同点都会有线程，有效地相互踩踏。

A simple proof-point for this is to change your grid configuration to a single thread:一个简单的证明点是将网格配置更改为单个线程：

dim3 block(1);
dim3 grid(1);

Then the errors disappear, even for n = 1000 , and the array is properly sorted.然后错误消失，即使对于n = 1000 ，并且数组已正确排序。 Because of the typical lockstep nature of a single warp, this should also work:由于单个经线的典型锁步性质，这也应该有效：

dim3 block(32);
dim3 grid(1);

Certain larger configurations might work in some cases, but with a large enough number of blocks, eventually all threads will not start at the same time, and this leads to trouble.某些较大的配置在某些情况下可能会起作用，但是如果块数量足够多，最终所有线程都不会同时启动，这会导致麻烦。

Answer 2

An alternate solution is to use a single thread with a most significant digit radix sort to separate the array into multiple bins, such as 256 bins if using base 256 After this initial step, each of the bins can be sorted independently and in parallel using conventional least significant digit first radix sort.另一种解决方案是使用具有最高有效位基数排序的单个线程将数组分成多个 bin，例如如果使用 base 256，则为 256 个 bin。最低有效位第一基数排序。 For this to work well, the data needs to be reasonably uniform (at least for the most significant digit) so that the multiple bins are somewhat similar in size.为了使其正常工作，数据需要合理统一（至少对于最重要的数字），以便多个 bin 的大小有点相似。

Trying to do the initial step in parallel is complicated.尝试并行执行初始步骤很复杂。

在 C 与 CUDA 中并行化基数排序的问题

问题描述

2 个解决方案

解决方案1
0 2021-12-30 18:53:32

解决方案2
0 2022-01-07 21:40:05

在 C 与 CUDA 中并行化基数排序的问题

问题描述

2 个解决方案

解决方案1 0 2021-12-30 18:53:32

解决方案2 0 2022-01-07 21:40:05

解决方案1
0 2021-12-30 18:53:32

解决方案2
0 2022-01-07 21:40:05