基数排序并行算法C程序openmp

Question

I'm trying to parallelize the following Radix Sort algorithm C code using OpenMP but I have some doubts about using the OpenMP clauses.我正在尝试使用 OpenMP 并行化以下基数排序算法 C 代码，但我对使用 OpenMP 子句有一些疑问。 In particular, there are some loops where I doubt that they can be parallelized at all.特别是，有一些循环我怀疑它们是否可以并行化。

Here is the code I'm working on:这是我正在处理的代码：

unsigned getMax(size_t n, unsigned arr[n]) {
    unsigned mx = arr[0];
    unsigned i;
    #pragma omp parallel for reduction(max:mx) private(i)
    for (i = 1; i < n; i++)
        if (arr[i] > mx)
            mx = arr[i];
    return mx;
}
 
void countSort(size_t n, unsigned arr[n], unsigned exp) {
    unsigned output[n]; // output array
    int i, count[10] = { 0 };
    // Store count of occurrences in count[]
    #pragma omp parallel for private(i)
    for (i = 0; i < n; i++) {
        #pragma omp atomic
        count[(arr[i] / exp) % 10]++; }
 
    for (i = 1; i < 10; i++)
        count[i] += count[i - 1];
 
    // Build the output array
    #pragma omp parallel for private(i)
    for (i = (int) n - 1; i >= 0; i--) {
        #pragma omp atomic write
        output[count[(arr[i] / exp) % 10] - 1] = arr[i];
        count[(arr[i] / exp) % 10]--;
    }
    
    #pragma omp parallel for private(i)
    for (i = 0; i < n; i++)
        arr[i] = output[i];
}
 
// The main function to that sorts arr[] of size n using Radix Sort
void radixsort(size_t n, unsigned arr[n], int threads) {
    omp_set_num_threads(threads);
    unsigned m = getMax(n, arr);
    unsigned exp;
    for (exp = 1; m / exp > 0; exp *= 10)
        countSort(n, arr, exp);
}

In particular, I'm not sure if for loops like the following can be parallelized or not:特别是，我不确定像下面这样的for循环是否可以并行化：

    for (i = 1; i < 10; i++)
            count[i] += count[i - 1];

    #pragma omp parallel for private(i)
    for (i = (int) n - 1; i >= 0; i--) {
        #pragma omp atomic write
        output[count[(arr[i] / exp) % 10] - 1] = arr[i];
        count[(arr[i] / exp) % 10]--;
    }

I'm asking for help on the specific OMP clauses I should use;我正在寻求有关我应该使用的特定 OMP 条款的帮助； other comments on the code shown are also welcome.也欢迎对显示的代码提出其他意见。

Answer 1

First of all to parallelize a code a reasonable amount of work is needed, otherwise the parallel overheads are bigger than the gain by parallelization.首先，要并行化代码需要合理的工作量，否则并行开销大于并行化带来的收益。 This is definitely the case in your example, since you create the output array on stack (so it cannot be big enough).在您的示例中绝对是这种情况，因为您在堆栈上创建了output数组（因此它不能足够大）。 Comments on your code:对您的代码的评论：

Both loops you mention in your question depend on the order of execution, so they cannot be parallelized easily/efficiently.您在问题中提到的两个循环都取决于执行顺序，因此它们无法轻松/高效地并行化。 Note also that there is a race condition when count array is accessed.另请注意，访问count数组时存在竞争条件。
If you select a base which is a power of 2 ( 2^k ), you can get rid off expensive integer division and you can use fast bitwise/shift operators instead.如果您选择的底数是 2 的幂 ( 2^k )，则可以摆脱昂贵的整数除法，而可以改用快速按位/移位运算符。
Always define your variables in their minimal required scope.始终在所需的最小范围内定义变量。 So instead of所以代替

unsigned i;
#pragma omp parallel for reduction(max:mx) private(i)
for (i = 1; i < n; i++) ....

the following code is preferred:以下代码是首选：

#pragma omp parallel for reduction(max:mx)
for (unsigned i = 1; i < n; i++) ....

To copy your array, memcpy can be used: memcpy(arr,output,n*sizeof(output[0]))要复制阵列memcpy可以使用： memcpy(arr,output,n*sizeof(output[0]))
In this loop在这个循环中

#pragma omp parallel for private(i)
    for (i = 0; i < n; i++) {
        #pragma omp atomic
        count[(arr[i] / exp) % 10]++; }

you can use reduction instead of atomic operation:您可以使用减少而不是原子操作：

#pragma omp parallel for private(i) reduction(+:count[10])
    for (i = 0; i < n; i++) {
        count[(arr[i] / exp) % 10]++; }

Answer 2

Radix sort can be parallelized if you split up the data.如果拆分数据，基数排序可以并行化。 One way to do this is to use a most significant digit radix sort for the first pass, to create multiple logical bins.一种方法是对第一遍使用最高有效数字基数排序，以创建多个逻辑箱。 For example, if using base 256 (2^8), you end up with 256 bins, which radix sort can then sort in parallel, based on the number of logical cores on your system.例如，如果使用基数 256 (2^8)，您最终会得到 256 个 bin，然后基数排序可以根据系统上的逻辑核心数并行排序。 With 4 cores, you can sort 4 bins at a time.使用 4 个内核，您可以一次对 4 个 bin 进行排序。 This relies on having somewhat uniform distribution of the most significant digit, otherwise so that the bins are somewhat equal in size.这依赖于最高有效数字的某种均匀分布，否则箱的大小有些相等。

Trying to optimize the first pass may not help much, since you'll need atomic read|write for the to update a bin index, and the random access writes to anywhere in the destination array will create cache conflicts.尝试优化第一遍可能无济于事，因为您需要原子读|写才能更新 bin 索引，并且随机访问写入目标数组中的任何位置都会产生缓存冲突。

基数排序并行算法C程序openmp

问题描述

2 个解决方案

解决方案1
1 2021-11-01 09:14:01

解决方案2
0 2021-11-01 16:10:44

基数排序并行算法C程序openmp

问题描述

2 个解决方案

解决方案1 1 2021-11-01 09:14:01

解决方案2 0 2021-11-01 16:10:44

解决方案1
1 2021-11-01 09:14:01

解决方案2
0 2021-11-01 16:10:44