简体   繁体   English

LSD基数排序为负整数,无队列

[英]LSD radix sort for negative integers without queue

First of all, i know there is a similar question over here: Radix Sort for Negative Integers 首先,我知道这里还有一个类似的问题: 负整数的基数排序

however it is not duplicate to this one. 但是,这不是重复的。

I am studying radix sorts and have a question regarding the implementation of LSD radix sort by Prof. Sedgewick and Prof. Wayne. 我正在研究基数排序,并且对Sedgewick教授和Wayne教授对LSD基数排序的实现有疑问。

public static void sort(int[] a) {
    final int BITS = 32;                 // each int is 32 bits 
    final int R = 1 << BITS_PER_BYTE;    // each bytes is between 0 and 255
    final int MASK = R - 1;              // 0xFF
    final int w = BITS / BITS_PER_BYTE;  // each int is 4 bytes

    int n = a.length;
    int[] aux = new int[n];

    for (int d = 0; d < w; d++) {         

        // compute frequency counts
        int[] count = new int[R+1];
        for (int i = 0; i < n; i++) {           
            int c = (a[i] >> BITS_PER_BYTE*d) & MASK;
            count[c + 1]++;
        }

        // compute cumulates
        for (int r = 0; r < R; r++)
            count[r+1] += count[r];

        // for most significant byte, 0x80-0xFF comes before 0x00-0x7F
        if (d == w-1) {
            int shift1 = count[R] - count[R/2];
            int shift2 = count[R/2];
            for (int r = 0; r < R/2; r++)
                count[r] += shift1;
            for (int r = R/2; r < R; r++)
                count[r] -= shift2;
        }

        // move data
        for (int i = 0; i < n; i++) {
            int c = (a[i] >> BITS_PER_BYTE*d) & MASK;
            aux[count[c]++] = a[i];
        }

        // copy back
        for (int i = 0; i < n; i++)
            a[i] = aux[i];
}

What is going on with the most significant byte? 最高有效字节是怎么回事? It is far more elegant than anything i came up with. 它比我想出的任何东西都要优雅得多。

I am not confident in my ability to explain that block of code, it is obvious that it deals with negative numbers but i am not exactly sure how. 我对解释该代码块的能力没有信心,很明显,它可以处理负数,但我不确定如何处理。

Could somebody explain that block of code in greater detail ? 有人可以更详细地解释该代码块吗?

UPDATE 更新

I think i got additionally confused naming of variables shift1 and shift2 . 我认为我另外混淆了变量shift1shift2的命名。 If we rename things a bit, and add a comment or two: 如果我们重新命名,然后添加一两个注释:

 if (d == w-1) {
            int totalNegatives= count[R] - count[R/2];
            int totalPositives= count[R/2];
            for (int r = 0; r < R/2; r++)
                // all positive number must come after any negative number
                count[r] += totalNegatives;
            for (int r = R/2; r < R; r++)
                // all negative numbers must come before any positive number
                count[r] -= totalPositives;
        }

this becomes easier to follow. 这变得更容易遵循。

The idea is that first positive number can only be in position after last negative number, and all positive numbers must be after negative ones in sorted order. 这个想法是,第一个正数只能在最后一个负数之后,并且所有正数必须按排序顺序在负数之后。 Therefore we simply need to add count of total negative numbers to all positives in order to ensure that positive numbers will indeed come after negatives. 因此,我们只需要将所有负数的总负数相加即可确保正数确实在负数之后。 Same analogy for negatives numbers. 负数也类似。

Basic algorithm 基本算法

Let's start by ignoring the block for the most significant bit and try to understand the rest of the code. 让我们从忽略最高有效位的块开始,并尝试理解其余的代码。

The algorithms handles the integers byte by byte. 该算法逐字节处理整数。 Every byte can have 256 different values, which are counted separately. 每个字节可以具有256个不同的值,这些值分别进行计数。 This is what happens in the first block. 这就是在第一块中发生的情况。 After

int[] count = new int[R+1];
for (int i = 0; i < n; i++) {           
    int c = (a[i] >> BITS_PER_BYTE*d) & MASK;
    count[c + 1]++;
}

every count[i] is the number of elements in a that have value i-1 in their d th byte (note that they use count[c + 1]++ , so count[0] == 0 ) count[i]是元素的数量a具有值i-1在它们的d个字节(请注意,它们使用count[c + 1]++ ,因此count[0] == 0

The algorithm then continues to compute the cumulative counts with 然后,算法继续使用

for (int r = 0; r < R; r++)
    count[r+1] += count[r];

After this, every count[i] is the index where the first element of that bucket should end up in the (intermediate) output. 之后,每个count[i]是该存储桶的第一个元素应在(中间)输出中结束的索引。 (Note that count has length 257 ( R + 1 ), so the last element of the cumulative array can be ignored. I'll put it in brackets in the examples below.) Let's look at an example with 4 values (instead of 256, to keep it concise): (请注意, count长度为257( R + 1 ),因此可以忽略累积数组的最后一个元素。在下面的示例中,将其放在方括号中。)让我们看一个具有4个值(而不是256个)的示例,以使其简洁):

Consider an array with byte values [0, 3, 3, 2, 1, 2] . 考虑一个字节值为[0, 3, 3, 2, 1, 2] 0,3,3,2,1,1,2]的数组。 This gives counts [0, 1, 1, 2, 2] and cumulative counts [0, 1, 2, 4, (6)] . 这给出了计数[0, 1, 1, 2, 2]和累积计数[0, 1, 2, 4, (6)] These are exactly the indices of the first 0 , 1 , 2 , and 3 in the sorted array (which would be [0, 1, 2, 2, 3, 3] ). 这些正是的第一索引012 ,和3所述排序后的数组中(这将是[0, 1, 2, 2, 3, 3]

Now the algorithm uses those cumulative counts as indices in the (intermediate) output. 现在,该算法将这些累积计数用作(中间)输出中的索引。 It increments the bucket index whenever it copies an element from that bucket, so elements from the same bucket are copied to consecutive spots. 每当它从该存储桶中复制元素时,它都会增加存储桶索引,因此同一存储桶中的元素将被复制到连续的位置。

for (int i = 0; i < n; i++) {
    int c = (a[i] >> BITS_PER_BYTE*d) & MASK;
    aux[count[c]++] = a[i];
}

for (int i = 0; i < n; i++)
    a[i] = aux[i];

Handling the sign bit 处理标志位

The most significant bit is a bit special because in two's complement it is the sign, which is 1 for negative numbers and 0 for positive numbers. 最高有效位有点特殊,因为在二进制补码中是符号,负数为1,正数为0。 So the cumulative array count is incorrect for the final step. 因此,最后一步的累积数组count不正确。 The counts for values whose most significant bit are 0 (the positive numbers) are in the first half of the array and the counts for the values whose most significant bit are 1 (the negative numbers) are in the second half of the array. 最高有效位为0(正数)的值的计数位于数组的前半部分,最高有效位为1(负数)的值的计数位于数组的后半部分。 Therefore, the first half and the second half of the array must be "flipped". 因此,必须将阵列的前半部分和后半部分“翻转”。

This is achieved by adding the total number of elements in the second half of the counts array to each element in the first half of the counts array. 这是通过将counts数组后半部分的元素总数与counts数组前半部分的每个元素相加而实现的。 And by subtracting the total number of elements in the first half of the counts array from each element in the second half of the counts array. 并从counts数组的后半部分的每个元素中减去counts数组的前半部分的元素总数。 As noted earlier, the counts array has length 257, so the first 128 elements (257 / 2) are the first half and the remaining 129 elements are the second half. 如前所述, counts数组的长度为257,因此前128个元素(257/2)是前一半,其余129个元素是后一半。

Let's look at a new example, now with two-bits signed values, ie, -2 , -1 , 0 , and 1 . 让我们来看看一个新的例子,现在有符号值的两个位,即, -2-101 The binary representation for these is 10 , 11 , 00 , 01 , so mapped to unsigned numbers that is 2 , 3 , 0 , 1 , respectively. 这些二进制表示是10110001 ,所以映射到无符号数是2301 ,分别。

Consider and array a as [0, -1, -1, -2, 1, -2] . 考虑并将数组a[0, -1, -1, -2, 1, -2] Convert to unsigned: [0, 3, 3, 2, 1, 2] . 转换为无符号: [0, 3, 3, 2, 1, 2] 0,3,3,2,1,2 [0, 3, 3, 2, 1, 2] Apply the algorithm to get the cumulative counts: [0, 1, 2, 4, (6)] . 应用该算法以获取累积计数: [0, 1, 2, 4, (6)] If we would not do the flipping, we would end up with the sorted unsigned array [0, 1, 2, 2, 3, 3] , which is equivalent to the signed array [0, 1, -2, -2, -1, -1] . 如果我们不进行翻转,则将得到排序后的无符号数组[0, 1, 2, 2, 3, 3] 0,1,2,2,3,3 [0, 1, 2, 2, 3, 3] ,它等效于有符号数组[0, 1, -2, -2, -1, -1] That's not properly sorted. 排序不正确。

Now, let's apply the extra step for the signed bytes. 现在,让我们对签名字节应用额外的步骤。 We split the cumulative counts array in two halves: [0, 1] and [2, 4, (6)] . 我们将累积counts数组分为两半: [0, 1][2, 4, (6)] There are 2 (2 - 0) elements in the first half and 4 (6 - 2) elements in the second half. 前半部分有2(2-0)个元素,后半部分有4(6-2)个元素。 So we add 4 to each element in the first half: [4, 5] and subtract 2 from each element in the second half: [0, 2, (4)] . 因此,我们在上半部分的每个元素上添加4: [4, 5]并从下半部分的每个元素中减去2: [0, 2, (4)] Combining the halves gives [4, 5, 0, 2, (4)] . 将两半结合起来得到[4, 5, 0, 2, (4)] 4,5,0,2 [4, 5, 0, 2, (4)]

If we now use these counts as indices in the final unsigned array, we get [2, 2, 3, 3, 0, 1] (the first 0 is at index 4, the first 1 at index 5, and so on). 如果现在将这些计数用作最终无符号数组中的索引,则会得到[2, 2, 3, 3, 0, 1] 2,2,3,3,0,1 [2, 2, 3, 3, 0, 1] (第一个0在索引4处,第一个1在索引5处,依此类推)。 Converting this back to signed values gives [-2, -2, -1, -1, 0, 1] , which is indeed correct. 将其转换回带符号的值将得到[-2, -2, -1, -1, 0, 1] ,这的确正确。


Possible confusion : one of the confusing parts in this algorithm is that the counts array is used for two different purposes. 可能的混乱 :此算法中令人困惑的部分之一是counts数组用于两个不同的目的。 First it's used to count separate occurrences and later it's used to count cumulative occurrences. 首先,它用于对单独的事件进行计数,然后用于对累积的事件进行计数。 When counting separately, the first element of the array is not used. 当单独计数时,不使用数组的第一个元素。 When counting cumulatively, the last element of the array is not used. 累积计数时,不使用数组的最后一个元素。

I think the algorithm would have been simpler if two separate arrays were used instead. 我认为如果使用两个单独的数组来代替,算法会更简单。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM