简体   繁体   中英

Radix Sort for Strings in Python

My radix sort function outputs sorted but wrong list when compared to Python's sort:

My radix sort: ['aa', 'a', 'ab', 'abs', 'asd', 'avc', 'axy', 'abid']
Python's sort: ['a', 'aa', 'ab', 'abid', 'abs', 'asd', 'avc', 'axy']

* My radix sort does not do padding
* Its mechanism is least significant bit (LSB)
* I need to utilise the length of each word

The following is my code.

def count_sort_letters(array, size, col, base):
    output   = [0] * size
    count    = [0] * base
    min_base = ord('a')

    for item in array:
        correct_index = min(len(item) - 1, col)
        letter = ord(item[-(correct_index + 1)]) - min_base
        count[letter] += 1

    for i in range(base - 1):
        count[i + 1] += count[i]

    for i in range(size - 1, -1, -1):
        item = array[i]
        correct_index = min(len(item) - 1, col)
        letter = ord(item[-(correct_index + 1)]) - min_base
        output[count[letter] - 1] = item
        count[letter] -= 1

    return output


def radix_sort_letters(array):
    size = len(array)

    max_col = len(max(array, key = len))

    for col in range(max_col):
        array = count_sort_letters(array, size, col, 26)

    return array

Can anyone find a way to solve this problem?

As I mentioned in my comments:

In your code the lines:

correct_index = min(len(item) - 1, col)
letter = ord(item[-(correct_index + 1)]) - min_base

Always uses the first letter of the word once col is greater than the word length. This causes shorter words to be sorted based upon their first letter once col is greater than the word length. For instance ['aa', 'a'] remains unchanged since on the for col loop we compare the 'a' in both words, which keeps the results unchanged.

Code Correction

Note: Attempted to minimize changes to your original code

def count_sort_letters(array, size, col, base, max_len):
  """ Helper routine for performing a count sort based upon column col """
  output   = [0] * size
  count    = [0] * (base + 1) # One addition cell to account for dummy letter
  min_base = ord('a') - 1 # subtract one too allow for dummy character

  for item in array: # generate Counts
    # get column letter if within string, else use dummy position of 0
    letter = ord(item[col]) - min_base if col < len(item) else 0
    count[letter] += 1

  for i in range(len(count)-1):   # Accumulate counts
      count[i + 1] += count[i]

  for item in reversed(array):
    # Get index of current letter of item at index col in count array
    letter = ord(item[col]) - min_base if col < len(item) else 0
    output[count[letter] - 1] = item
    count[letter] -= 1

  return output

def radix_sort_letters(array, max_col = None):
  """ Main sorting routine """
  if not max_col:
    max_col = len(max(array, key = len)) # edit to max length

  for col in range(max_col-1, -1, -1): # max_len-1, max_len-2, ...0
    array = count_sort_letters(array, len(array), col, 26, max_col)

  return array

lst = ['aa', 'a', 'ab', 'abs', 'asd', 'avc', 'axy', 'abid']
print(radix_sort_letters(lst))

Test

lst = ['aa', 'a', 'ab', 'abs', 'asd', 'avc', 'axy', 'abid']
print(radix_sort_letters(lst))

# Compare to Python sort
print(radix_sort_letters(lst)==sorted(lst))

Output

['a', 'aa', 'ab', 'abid', 'abs', 'asd', 'avc', 'axy']
True

Explanation

Counting Sort is a stable sort meaning:

Let's walk through an example of how the function works.

Let's sort: ['ac', 'xb', 'ab']

We walk through each character of each list in reverse order.

Iteration 0:

 Key is last character in list (ie index -1): keys are ['c','b', 'b'] (last characters of 'ac', 'xb', and 'ab' Peforming a counting sort on these keys we get ['b', 'b', 'c'] This causes the corresponding words for these keys to be placed in the order: ['xb', 'ab', 'ac'] Entries 'xb' and 'ab' have equal keys (value 'b') so they maintain their order of 'xb' followed by 'ab' of the original list (since counting sort is a stable sort)

Iteration 1:

 Key is next to last character (ie index -2): Keys are ['x', 'a', 'a'] (corresponding to list ['xb', 'ab', 'ac']) Counting Sort produces the order ['a', 'a', 'a'] which causes the corresponding words to be placed in the order ['ab', 'ac', 'xb'] and we are done.

Original Software Error --your code originally went left to right through the strings rather than right to left. We need to go right to left since we want to sort our last sort to be based upon the first character, the next to last to be based upon the 2nd character, etc.

Different Length Strings -the example above was with equal-length strings.

The previous example was simplified assuming equal length strings. Now let's try unequal lengths strings such as:

['ac', 'a', 'ab']

This immediately presents a problem since the words don't have equal lengths we can't choose a letter each time.

We can fix by padding each word with a dummy character such as '*' to get:

['ac', 'a*', 'ab']

Iteration 0: keys are last character in each word, so: ['c', '*', 'b']

 The understanding is that the dummy character is less than all other characters, so the sort order will be: ['*', 'b', 'c'] causing the related words to be sorted in the order ['a*', 'ab', 'ac']

Iteration 1: keys are next to last character in each word so: ['a', 'a', 'a']

 Since the keys are all equal counting sort won't change the order so we keep ['a*', 'ab', 'ac'] Removing the dummy character from each string (if any) we end up with: ['a', 'ab', 'ac']

The idea behind get_index is to mimic the behavior of padding strings without actual padding (ie padding is extra work). Thus, based upon the index it evaluates if the index points to the padded or unpadded portion of the string and returns an appropriate index into the counting array for counting.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM