简体   繁体   中英

Python, understanding Huffman code

I am trying to understand a Huffman code written in python by 'Rosetta code'. Following is small part from the code.

def encode(symb2freq):
    heap = [[wt, [sym, ""]] for sym, wt in symb2freq.items()] #What does this do?

I'm assuming variable heap is a list. But what is wt and sym ?

sym2freq is a dictionary with keys some symbols and their values the symbol's frequency. For example, if you have the string 'aaabacba', the dictionary will look like this

sym2freq = {'a': 5, 'b': 2, 'c': 1}

That's because we have 5 times the letter a, 2 times the letter b and 1 time the letter c.

Dictionaries have the method items() , which will return a tuple of each key with its respective value. In our case, it would return

>>> sym2freq.items()
(('a', 5), ('b', 2), ('c', 1))

The for sym, wt in symb2freq.items() part of the comprehension list is just unpacking. Every time we fetch one of our tuples from above, we assign the first object to the variable sym and the second to the variable wt . The names sym and wt were chosen purely to be reflective of the values the represent, ie symbol and weight (frequency).

>>> sym, wt = ('a', 5)
>>> print sym
'a'
>>> print wt
5

And since the list comprehension will create lists of the structure [wt, [sym, ""]] , you'll end up with the list

>>> encode(sym2freq)
[[5, ['a', '']], [2, ['b', '']], [1, ['c', '']]]

The reason why we go from a dictionary of symbol frequencies to a structure like the list heap is so we can sort out symbols in terms of their frequencies, which, as you are learning, is part of constructing the Huffman tree.

This is a list comprehension . This is saying

  • Get the items of symb2freq and start to loop over them.
  • Get the first item in symb2freq and unpack them into variables sym and wt .
  • Now add [wt, [sym, ""]] to the list
  • Do this for each item
  • Now put the list in the variable heap

For example, [bar(x) for x in foo] makes a list that applies bar(x) to each value in the list.

This is an interesting problem. Imagine a dictionary with five letters where the key is the letter and the value is the frequency:

a = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}

then mapping it so that it's a list of lists with frequency, and a list starting with the letter into the heap variable like this:

[[1, ['a', '']], [2, ['b', '']], [3, ['c', '']], [4, ['d', '']], [5, ['e', '']]]

then when you heapify it, you get:

[[1, ['a', '']],
 [3, ['c', '']],
 [2, ['b', '']],
 [5, ['e', '']],
 [4, ['d', '']]]

Just work things out step by step with a valid dictionary to encode like the one given on the wikipedia page and you'll be able to see what each step does. Or use a trivial example to start like the one I gave and slowly grow the number of elements until you're working with a real document.

What the rosetta code is doing is iterating through the heap and adding digits to the blank string you see after the letter based on frequency, until each letter is mapped to a binary interpretation. So in the end you'd having something like this:

 [['e', '101'], 
  ['d', '010'], 
  ['c', '1001'], 
  ['b', '1100'], 
  ['a', '1101']]

where the most frequent letters require the fewest bits.

sym is symbol, and wt is weight.

https://en.wikipedia.org/wiki/Huffman_coding

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM