简体   繁体   English

如何改进基数排序的实现?

[英]How to improve on this implementation of the radix-sort?

I'm implementing a 2-byte Radix Sort. 我正在实现一个2字节的基数排序。 The concept is to use Counting Sort, to sort the lower 16 bits of the integers, then the upper 16 bits. 概念是使用计数排序,对整数的低16位进行排序,然后对高16位进行排序。 This allows me to run the sort in 2 iterations. 这使我可以2次迭代运行排序。 The first concept I had was trying to figure out how to handle negatives. 我的第一个概念是试图弄清楚如何处理负面因素。 Since the sign bit would be flipped for negative numbers, then in hex form, that would make negatives greater than the positives. 由于符号位将翻转为负数,因此采用十六进制形式,这将使负数大于正数。 To combat this I flipped the sign bit when it was positive, in order to make [0, 2 bil) = [128 000 000 000, 255 255...). 为了解决这个问题,我在正数时翻转了符号位,以使[0,2 bil)= [128 000 000 000,255 255 ...)。 And when it was negative I flipped all the bits, to make it range from (000 000 .., 127 255 ..). 当它为负数时,我将所有位翻转,以使其范围为(000 000 ..,127 255 ..)。 This site helped me with that information. 这个网站帮助我获得了这些信息。 To finish it off, I would split the integer into either the top or bottom 16-bits based on the pass. 为了完成它,我将根据传递将整数分成高16位或低16位。 The following is the code allowing me to do that. 以下是允许我执行此操作的代码。

static uint32_t position(int number, int pass) {
    int mask;
    if (number <= 0) mask = 0x80000000;
    else mask = (number >> 31) | 0x80000000;
    uint32_t out = number ^ mask;
    return pass == 0 ? out & 0xffff : (out >> 16) & 0xffff;
}

To start the actual Radix Sort, I needed to form a histogram of size 65536 elements. 要开始实际的基数排序,我需要形成一个65536个元素的直方图。 The problem I ran across was when the number of elements inputted was very large. 我遇到的问题是输入的元素数量非常大。 It would take a while to create the histogram, so I implemented it in parallel, using processes and shared memory. 创建直方图需要一段时间,因此我使用进程和共享内存并行实现了直方图。 I partitioned the array into subsections of size / 8. Then over an array of shared memory sized 65536 * 8, I had each process create its own histogram. 我将数组划分为大小为8的小节。然后在大小为65536 * 8的共享内存数组上,每个进程都创建了自己的直方图。 Afterwards, I summed it all together to form a single histogram. 然后,我将所有这些求和在一起形成一个直方图。 The following is the code for that: 以下是该代码:

for (i=0;i<8;i++) {
    pid_t pid = fork();
    if (pid < 0) _exit(0);
    if (pid == 0) {
        const int start = (i * size) >> 3;
        const int stop  = i == 7 ? size : ((i + 1) * size) >> 3;
        const int curr  = i << 16;
        for (j=start;j<stop;++j)
            hist[curr + position(array[j], pass)]++;
        _exit(0);
    }
}
for (i=0;i<8;i++) wait(NULL);

for (i=1;i<8;i++) {
    const int pos = i << 16;
    for (j=0;j<65536;j++)
        hist[j] += hist[pos + j];
}

The next part was where I spent most of my time analyzing how cache affected the performance of the prefix-sum. 下一部分是我大部分时间都在哪里分析缓存如何影响前缀和的性能。 With an 8-bit and 11-bit pass Radix Sort, all of the histogram would fit within L1 cache. 通过8位和11位的基数排序,所有直方图都将适合L1缓存。 With 16-bits, it would only fit within L2 cache. 如果使用16位,则仅适用于L2缓存。 In the end the 16-bit histogram ran the sum the fastest, since I only had to run 2 iterations with it. 最后,由于我只需要运行2次迭代,因此16位直方图的总和运行速度最快。 I also ran the prefix sum in parallel as per the CUDA website recommendations. 我还根据CUDA网站的建议并行运行前缀总和。 At 250 million elements, this ran about 1.5 seconds slower than the 16-bit integer. 在2.5亿个元素上,这比16位整数慢了约1.5秒。 So my prefix sum ended up looking like this: 所以我的前缀总和看起来像这样:

for (i=1;i<65536;i++)
    hist[i] += hist[i-1];

The only thing left was to traverse backwards through the array and put all the elements into their respective spots in the temp array. 剩下的唯一事情就是向后遍历数组,并将所有元素放入temp数组中的相应位置。 Since I only had to go through twice, instead of copying from the temp back to array, and running the code again. 因为我只需要经历两次,而不是从临时复制回数组,然后再次运行代码。 I ran the sort first using array as the input, and temp as the output. 我首先使用数组作为输入,并使用temp作为输出来运行排序。 Then ran it the second time using temp as the input and array as the output. 然后使用temp作为输入并使用数组作为输出第二次运行它。 This kept me from mem-copying back to array both times. 这使我无法两次将内存复制回阵列。 The code looks like this for the actual sort: 对于实际的排序,代码如下所示:

histogram(array, size, 0, hist);
for (i=size-1;i>=0;i--)
    temp[--hist[position(array[i], 0)]] = array[i];

memset(hist, 0, arrSize);
histogram(temp, size, 1, hist);
for (i=size-1;i>=0;i--)
    array[--hist[position(temp[i], 1)]] = temp[i];

This link contains the full code that I have so far. 此链接包含我到目前为止的完整代码。 I ran a test against quicksort, and it ran between 5 and 10 times faster with integers and floats, and about 5 times faster with 8-byte data types. 我对quicksort进行了测试,使用整数和浮点运算的速度提高了5到10倍,而使用8字节数据类型的速度提高了约5倍。 Is there a way to improve on this? 有办法改善吗?

My guess would be that treating the sign of the integers during operation is not worth it. 我的猜测是在操作期间处理整数的符号是​​不值得的。 It complexyfies and slows down your code. 它会复杂化并减慢您的代码速度。 I'd go for a first sort as unsigned and then do a second path that just reorders the two halves and inverts the one of the negatives. 我将首先进行unsigned排序,然后执行第二条路径,该路径仅将两个半部分重新排序并反转其中一个负数。

Also from your code I don't get how you have different processes operate together. 同样从您的代码中,我不了解您如何使不同的流程一起运行。 How do you collect the histogram in the parent? 您如何在父级中收集直方图? you have a process shared variable? 你有一个过程共享变量? In any case using ptrhead would be much more appropriate, here. 无论如何,在这里使用ptrhead会更合适。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM