简体   繁体   中英

Merge Order in Huffman Coding with same weight trees

I am really struggling with the order of merging trees that have the same "weight" in Huffman Coding. I looked into a lot of sources but all of them seem to cover just "simple cases" where there are no more than two elements with the same weight or the just don't cover the whole topic at all.



Lets say I have the following String I want to encode: ABCDEE . (Style based on this website )
So I have:

    FREQUENCY       VALUE
    ---------       -----
         1            A
         1            B
         1            C
         1            D
         2            E

I start building the tree now with two of the smallest elements:
Question 1) Do I have to use A & B or how am I able to decide which values I should use? I know they have to be the smallest ones, but other than that? Eg A & D ?
This is important as at the end ( lets say I do the following: )

  2:[A&B]       2:[B&C]
    /  \          /  \
  1:A   1:B     1:B   1:C

and with that the following table:

    FREQUENCY       VALUE
    ---------       -----
         2          [A&B]
         2          [C&D]
         2            E

Question 2) Again... in which order should I merge the trees? Eg [A&B]&E or [A&B]&[C&D]
Because, if I merge [A&B]&E first, the tree will look like this:

      4:[A&B&E]
        /   \
    2:[A&B]   2:E
    /   \
  1:A   1:B

( Question 3) And how to decide if 2:E should be on the left or on the right? )

And after joining [C&D] the final tree looks like this:

     6:[A&B&C&D&E]
       /       \
 2:[C&D]    4:[A&B&E]
   /  \        /   \
 1:C   1:D  2:[A&B] 2:E
             /   \
           1:A   1:B

BUT if I start with joining [A&B]&[C&D] :

     4:[A&B&C&D]
      /        \
 2:[A&B]      2:[C&D]
   /   \       /   \
  1:A   1:B  1:C  1:D

And then join E , the final tree looks like this:

     6:[A&B&C&D&E]
       /       \
    E:2      4:[A&B&C&D]
             /        \
        2:[A&B]      2:[C&D]
          /   \       /   \
         1:A   1:B  1:C  1:D

So in the first variant E would be 11 and in the second variant 0 . Or as another example C would be 00 vs being 110 ...

I think there must be an elementary rule I'm missing here, because Huffman Coding has to be deterministic (to decode it properly), doesn't it!?

When you have more than one choice for the two lowest weights, it does not matter which pair you pick. For all choices the Huffman algorithm will return a set of codes that minimizes the total number of bits to code the provided set.

As a result the Huffman algorithm is not deterministic, unless other constraints are placed on the choices. Even though the algorithm can provide different results, this does not prevent an encoder / decoder combination from being deterministic. All that is required is that the resulting Huffman code be properly transmitted along with the coded data, so that the decoder can decode it. The only thing that the non-determinism shows is that the set of frequencies for the symbols is not a sufficient descriptor of a Huffman code.

As noted in another answer, the multitude of possible codes is reduced by requiring that the code be canonical . This reduces the number of bits required to transmit the Huffman code, since you no longer have to discriminate over all possible resulting codes. For a canonical code, you do not have to describe the Huffman tree or the specific bit values of the codes. That is all derivable from simply the number of bits required to code each symbol. So a sufficient descriptor of a Huffman code (or actually any prefix code) is the number of bits for each symbol.

It is important to note that even if you constrain the result to a canonical code, the Huffman algorithm can still result in a different set of code lengths for the same set of frequencies, depending on the choices made when picking the two lowest weight subtrees. Here is an example:

Consider the symbols and frequencies A: 2, B: 2, C: 1, D: 1. You necessarily combine C and D to get a weight of 2. Now you have three weights of 2, and so three choices to combine. A and B, A and CD, or B and CD. The first choice is fundamentally different than the last two choices. If you combine A and B, the resulting code lengths in bits is: A = 2, B = 2, C = 2, and D = 2. If on the other hand you combine B and CD, then you end up with A = 1, B = 2, C = 3, and D = 3. Two different canonical codes for the same set of frequencies!

You might then ask, which one is "right"? The answer is that they both are. The reason is that if you multiply the frequencies times the lengths, you get the same total number of bits to code the set. 2x2 + 2x2 + 1x2 + 1x2 = 2x1 + 2x2 + 1x3 + 1x3 = 12.

So don't be surprised if even when you constrain the code to be canonical, that you can get more than one answer from the Huffman algorithm.

The merge order it's really not important. What is important in this algorithm is to choose every time the smallest subtrees. Because greedily doing so , in the end , you will always get to the letters with the highest frequency in the shortest possible way in that tree.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM