简体   繁体   中英

Storing and reconstruction of Huffman tree

What is the best way to dehydrate a huffman tree, by dehydration I mean given a huffman tree, and the characters in each leaf, how can you efficiently store the structure of this tree, and later reconstruct it.

take the below tree:

---------------garbage------
 -------------/-------\------
 ------------A-------garbage-
 --------------------/-----\-
 -------------------B-------C-

one idea might be to store the symbol at each level and then use this information to reconstruct the tree. In this case: A1B2C2. So how can I first get the levels, and associate each level with the character.

You almost certainly do not need to store the tree itself. You could do, and it shouldn't take the space you think it does, but it's not generally necessary.

If your huffman codes are canonical, you need only store the bit-lengths for each symbol, as this is all the information required to generate a canonical coding. This is a relatively small number of bits per-symbol, so should be fairly compact. You also can further compress that information (see the answer from Aki Suihkonen ).

Naturally the bit-length of a code is essentially the same as the tree depth, so I think this is roughly what you're asking about. The important part is to know how to build a canonical code, given the lengths - it's not necessarily the same as the codes produced by traversing the tree. You could regenerate a tree from this, but it's not necessarily the tree you started with - however typically you don't need the tree other than to determine the code lengths in the first place.

The algorithm for generating canonical codes is fairly simple:

  1. Take all the symbols you want to generate codes for, sorted first by code-length (shortest first), and then by the symbol itself.
  2. Start with a zero-length code.
  3. If the next symbol requires more bits than are currently in the code, add zeros to the right (least significant bits) of your code until it's the right length.
  4. Associate the code with the current symbol, and increment the code.
  5. Loop back to (3) until you have generated all the symbols.

Take the string "banana". Obviously there are 3 symbols used, 'b', 'a', and 'n', with counts of 1, 3, and 2, respectively.

So the tree might look like this:

*
   / \
  *   a
 / \
b   n

Naively, that could give codes:

a = 1
b = 00
n = 01

However if instead you simply use the bit-lengths as input to canonical code generation, you would produce this:

a = 0
b = 10
n = 11

Its a different code, but obviously it would produce the same length compressed output. Further more, you only need to store the code-lengths in order to reproduce the code.

So you only need to store a sequence:

0... 1 2 0... 2 0...

Where "..." represents easily compressible repetition, and the values will all be quite small (probably only 4-bits each - and note that the symbols aren't stored at all). This representation will be very compact.

If you you really must store the tree itself, one technique is to traverse the tree and store a single bit to indicate whether a node is internal or a leaf, and then for leaf nodes, storing the symbol code. This is fairly compact for trees which do not contain every symbol, and not too bad even for fairly complete trees. The worst case size for this would be the total size of all your symbols, plus as many single bits as you could have nodes. For a standard 8-bit byte stream, that would be 320 bytes (256 bytes for the codes, 511 bits for the tree structure itself).

The method is to start at the root node, and for each node:

  • If the node is a parent, output a 0 and then output the left then right children.
  • If the node is a leaf, output a 1 and then output the symbol

To reconstruct, perform a similar recursive procedure, but obviously reading the data and choosing whether to recursively create children, or read in a symbol, as appropriate.

For the example above, the bit-stream for the tree would be something like:

0, 0, 1, 'b', 1, 'n', 1, 'a'

That's 5 bits for the tree, plus 3 bytes for the symbols, rounding up to 4 bytes of storage. However it will grow rapidly as you add more symbols, whereas storing the code-lengths does not.

The zlib specification explains that to store a Huffman tree one only needs the bitlengths of each symbol. Eg if one constructs a tree for A=101, B=111, C=110, D=01, one will simply count the bitlengths and regenerate the tree from the lengths so that the keywords will be consecutive --> A=101,B=110,C=111, D=01. (or what ever the following code produces)

set bl_count[2]=1, bl_count[3]=3 and iterate:

code = 0;   // From z-lib specification, RFC 1951
bl_count[0] = 0;
for (bits = 1; bits <= MAX_BITS; bits++) {
    code = (code + bl_count[bits-1]) << 1;
    next_code[bits] = code;
}

As the maximum symbol length will be <16, one needs a maximum of 4 bits per symbol to store these lengths: 3,3,3,2 == 0011 0011 0011 0010; however, zlib/deflate does better -- it run length encodes these symbols using escape symbol such as 16 == run of 3, 17: run of 4, etc. to further compress the stream of symbol lengths. Also the RLE takes case of zero lengths, ie missing characters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM