简体   繁体   中英

Why do we need unsigned char for Huffman tree code

I am trying to create a Huffman tree the question I read is very strange for me, it is as follows:

Given the following data structure:

 struct huffman { unsigned char sym; /* symbol */ struct huffman *left, *right; /* left and right subtrees */ }; 

write a program that takes the name of a binary file as sole argument, builds the Huffman tree of that file assuming that atoms (elementary symbols) are 8-bit unsigned characters, and prints the tree as well as the dictionary.
allocations must be done using nothing else than malloc(), and sorting can be done using qsort().

Here the thing which confuses me is that to write a program to create a huffman tree we just need to do following things:

  1. We need to take a frequency array (That could be Farray[]={.......} )
  2. Sort it and add the two smallest nodes to form a tree until it don't left 1 final node(which is head).

Now the question is here: why and where do we need those unsigned char data? (what type of unsigned char data this question want, I think only frequency is enough to display a Huffman tree)?

If you purely want to display the shape of the tree, then yes, you just need to build it. However, for it to be of any use whatsoever you need to know what original symbol each node represents.

Imagine your input symbols are [ABCD]. An imaginary Huffman tree/dictionary might look like this:

         ( )
        /   \              A = 1
      ( )   (A)            B = 00
     /   \                 C = 010
   (B)   ( )               D = 011
        /   \
      (C)   (D)

If you don't store sym , it looks like this:

         ( )
        /   \              A = ?
      ( )   ( )            B = ?
     /   \                 C = ?
   ( )   ( )               D = ?
        /   \
      ( )   ( )

Not very useful, that, is it?

Edit 2: The missing step in the plan is step 0: build the frequency array from the file (somehow I missed that you don't need to actually encode the file too). This isn't part of the actual Huffman algorithm itself and I couldn't find a decent example to link to, so here's a rough idea:

FILE *input = fopen("inputfile", "rb");
int freq[256] = {0};
int c;
while ((c = fgetc(input)) != EOF)
    freq[c]++;
fclose(input);

/* do Huffman algorithm  */
...

Now, that still needs improving since it neither uses malloc() nor takes a filename as an argument, but it's not my homework ;)

It's a while since I did this, but I think the generated "dictionary" is required to encode data, while the "tree" is used to decode it. Of course, you can always build one from the other.

While decoding, you traverse the tree (left/right, according to successive input bits), and when you hit a terminal node (null pointer) then the 'sym' in the node is the output value.

Usually data compression is divided into 2 big steps; given a stream of data:

  • evaluate the probability that a given symbol will appear in the stream, in other words you evaluate how frequent a symbol appears in a dataset
  • once you have studied the occurences and created your table with symbols associated with a probability, you need to encode the symbols according their probability, to achieve this magic you create a dictionary were the original symbol is often times just replaced with another symbol that is much smaller in size, especially true for symbols that are frequently used in the dataset, the dictionary keeps track of this substitutions for both the encoding and decoding phase. Hoffman gives you an algorithm to automate this process and get a fairly good result.

In practice it's a little bit more complicated than this, because trees are involved, but the main purpose is always to build the dictionary.

There is a complete tutorial here .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM