基数排序的并行版本未按预期运行（Java）

Question

in my project I found that sorting performance is the bottleneck.在我的项目中，我发现排序性能是瓶颈。 After some googling I came up with parallel version of radix sort (with base 256).经过一番谷歌搜索，我想出了基数排序的并行版本（以 256 为基数）。 However it is not behaving as I expected.但是，它的行为不像我预期的那样。

First changing the base to 2^16 doesn't cause any speedup and it should theoretically by 2.首先将基数更改为 2^16 不会导致任何加速，理论上应该是 2。

Second in my parallel version I split it to 4 parts (number of cores) and do radix sort on them, then I merge the result.第二个在我的并行版本中，我将它分成 4 个部分（核心数）并对它们进行基数排序，然后合并结果。 Again it runs merely at the same time as serial version.同样，它仅与串行版本同时运行。

public class RadixSortPrototype {


  public static void parallelSort(long[] arr) {
    long[] output = new long[arr.length];

    int MAX_PART = 1_000_000;
    int numProc = Runtime.getRuntime().availableProcessors();
    int partL = Math
        .min((int) Math.ceil(arr.length / (double) numProc), MAX_PART);
    int parts = (int) Math.ceil(arr.length / (double) partL);

    Future[] threads = new Future[parts];
    ExecutorService worker = Executors.newFixedThreadPool(numProc);

    for (int i = 0; i < 8; i++) {
      int[][] counts = new int[parts][256];
      int radix = i;

      for (int j = 0; j < parts; j++) {
        int part = j;
        threads[j] = worker.submit(() -> {
          for (int k = part * partL; k < (part + 1) * partL && k < arr.length;
              k++) {
            int chunk = (int) ((arr[k] >> (radix * 8)) & 255);
            counts[part][chunk]++;
          }
        });
      }
      barrier(parts, threads);

      int base = 0;
      for (int k = 0; k <= 255; k++) {
        for (int j = 0; j < parts; j++) {
          int t = counts[j][k];
          counts[j][k] = base;
          base += t;
        }
      }

      for (int j = 0; j < parts; j++) {
        int part = j;
        threads[j] = worker.submit(() -> {
          for (int k = part * partL;
              k < (part + 1) * partL && k < arr.length;
              k++) {

            int chunk = (int) ((arr[k] >> (radix * 8)) & 255);
            output[counts[part][chunk]] = arr[k];
            counts[part][chunk]++;
          }
        });
      }
      barrier(parts, threads);

      for (int j = 0; j < parts; j++) {
        int part = j;
        threads[j] = worker.submit(() -> {
          for (int k = part * partL;
              k < (part + 1) * partL && k < arr.length;
              k++) {

            arr[k] = output[k];
          }
        });
      }
      barrier(parts, threads);
    }
    worker.shutdownNow();
  }

  private static void barrier(int parts, Future[] threads) {
    for (int j = 0; j < parts; j++) {
      try {
        threads[j].get();
      } catch (InterruptedException | ExecutionException e) {
        e.printStackTrace();
      }
    }
  }
}

Any ideas why it is running so slow?任何想法为什么它运行这么慢？ What is the recommended way to tackle this optimization?解决此优化的推荐方法是什么？

I'm really curious about the answer.我真的很好奇答案。

Thanks!谢谢！

Update更新

Basing on the answer I improved locality of data, so now it uses all the cores.根据答案，我改进了数据的局部性，所以现在它使用了所有核心。 Updated the code snippet.更新了代码片段。 Here are results for 2-core 4-thread CPU.以下是 2 核 4 线程 CPU 的结果。

Java Parallel: 1130 ms
Radixsort Serial: 1218 ms
Radixsort Parallel: 625 ms

The question remains open if it can be further improved.如果可以进一步改进，这个问题仍然悬而未决。

Answer 1

Using base 2^16 = 65536 ends up a bit slower because L1 cache is typically 32768 bytes per core, and base 2^16 counts|indexes arrays each use 2^20 = 262144 bytes.使用基数 2^16 = 65536 最终会慢一些，因为 L1 缓存通常是每个内核 32768 字节，基数 2^16 计数|索引 arrays 每个使用 2^20 = 262144 字节。

The issue with radix sort is that the reads are sequential, but the writes are as random as the data.基数排序的问题是读取是顺序的，但写入与数据一样随机。 Based on the comment, the program is sorting 20 million longs at 8 bytes each, so 80 MB of data, and assuming 8MB L3 cache, most of those writes are going to be cache misses.根据评论，该程序以 8 个字节对 2000 万个 long 进行排序，因此 80 MB 的数据，假设 8MB L3 缓存，大部分写入将是缓存未命中。 The parallel operations aren't helping much because most of the writes are competing for the same 80 MB of non-cached main memory.并行操作没有多大帮助，因为大多数写入都在竞争相同的 80 MB 非缓存主 memory。

To avoid this issue, I used an alternate implementation where the first pass does a most significant digit radix sort to produce 256 bins (each bin contains integers with the same most significant byte).为了避免这个问题，我使用了另一种实现，其中第一遍执行最高有效数字基数排序以产生 256 个 bin（每个 bin 包含具有相同最高有效字节的整数）。 Then each bin is sorted using conventional radix sort least significant digit first.然后，首先使用传统的基数排序最低有效位对每个 bin 进行排序。 For reasonably uniform psuedo random data, the 256 bins end up nearly equal in size, so the 80MB is split into 256 bins, about 312500 bytes each, and for 4 threads, there are 8 of these bins, 4 for reads, 4 for writes, plus the count|index arrays, and all of this will fit into the 8MB L3 16 way associative L3 cache common to all 4 cores.对于相当均匀的伪随机数据，256 个 bin 的大小最终几乎相等，因此 80MB 被分成 256 个 bin，每个大约 312500 字节，对于 4 个线程，有 8 个 bin，4 个用于读取，4 个用于写入，加上计数|索引 arrays，所有这些都将适合所有 4 个内核通用的 8MB L3 16 路关联 L3 缓存。

For larger arrays, the initial pass could split up the array into 512 to 4096 or more bins.对于较大的 arrays，初始通道可以将阵列拆分为 512 到 4096 个或更多箱。

I did some testing with some old C++ code I have for radix sort for sorting pseudo random 64 bit integers, using base 2^8 = 256. I tested 3 implementations, single thread least significant digit, single thread most significant digit first, and quad thread most significant digit first.我对一些旧的 C++ 代码进行了一些测试，用于基数排序，用于对伪随机 64 位整数进行排序，使用基数 2^8 = 256。我测试了 3 个实现，单线程最低有效数字，单线程最高有效数字在前，然后四线程最重要的数字第一。 When the number of integers was a power of 2, it resulted in some cache conflicts, affecting the time in some cases.当整数个数为 2 的幂时，会导致一些缓存冲突，在某些情况下会影响时间。

16000000 - 8 bins + index arrays fit in 8MB L3 cache. 16000000 - 8 个 bin + 索引 arrays 适合 8MB L3 缓存。
16777216 = 2^24, 8 bins + index arrays fit in 8MB L3 cache. 16777216 = 2^24, 8 bins + index arrays 适合 8MB L3 缓存。
30000000 - 8 bins + index arrays fit in 8MB L3 cache. 30000000 - 8 个 bin + 索引 arrays 适合 8MB L3 缓存。
33554432 = 2^25, 8 bins + index arrays a bit larger than 8MB 33554432 = 2^25, 8 bins + index arrays 比 8MB 大一点
36000000 - 8 bins + index arrays a bit larger than the 8MB. 36000000 - 8 bins + index arrays 比 8MB 大一点。

Win 7 Pro 64 bit, VS 2015, Intel 3770K 3.5 ghz 
count        1 thread LSD  1 thread MSD  4 thread MSD
16000000     0.59          0.38          0.16
16777216     1.35          0.48          0.30
30000000     0.82          0.70          0.30
33554432     3.20          1.09          0.68
36000000     0.95          0.82          0.39

Win 10 Pro 64 bit, VS 2019, Intel 10510U 1.8 ghz to 4.9 ghz
count        1 thread LSD  1 thread MSD  4 thread MSD
16000000     0.312         0.230         0.125
16777216     0.897         0.242         0.150
30000000     0.480         0.430         0.236
33554432     2.880         0.510         0.250
36000000     0.568         0.530         0.305

基数排序的并行版本未按预期运行（Java）

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-03-24 23:24:44

基数排序的并行版本未按预期运行（Java）

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-03-24 23:24:44

解决方案1
1 已采纳 2021-03-24 23:24:44