简体   繁体   中英

Should I provide consistency checks in the Huffman tree building algorithm for DEFLATE?

In RFC-1951 there is a simple algorithm that restores the Huffman tree from a list of code lengths, described following way:

     1)  Count the number of codes for each code length.  Let
         bl_count[N] be the number of codes of length N, N >= 1.

     2)  Find the numerical value of the smallest code for each
         code length:

            code = 0;
            bl_count[0] = 0;
            for (bits = 1; bits <= MAX_BITS; bits++) {
                code = (code + bl_count[bits-1]) << 1;
                next_code[bits] = code;
            }

     3)  Assign numerical values to all codes, using consecutive
         values for all codes of the same length with the base
         values determined at step 2. Codes that are never used
         (which have a bit length of zero) must not be assigned a
         value.

            for (n = 0;  n <= max_code; n++) {
                len = tree[n].Len;
                if (len != 0) {
                    tree[n].Code = next_code[len];
                    next_code[len]++;
                }

But there is no any data consistency checks in the algorithm. On the other hand is it obvious that the lengths list can be invalid. The length values, because of encoding in 4 bits can not be invalid, but, for example, there can be more codes than can be encoded for some code length.

What is the minimal set of checks that will provide data validation? Or such checks are not needed for some reason that I missed?

I think that checking that next_code[len] does not overflow past its respective bits is enough. So after tree[n].Code = next_code[len]; , you can do the following check:

if (tree[n].Code & ((1<<len)-1) == 0)
    print(Error)

If tree[n].Code & ((1<<len)-1) reaches 0, it means that there are more codes of length len than they should, so the lengths list had an error in it. On the other hand, if every symbol of the tree is assigned a valid (unique) code, then you have created a correct Huffman tree.

EDIT: It just dawned on me: You can simply make the same check at the end of step one: You just have to check that bl_count[N] <= 2^N - SUM((2^j)*bl_count[Nj]) for all 1<=j<=N and for all N >=1 (If a binary tree has bl_count[N-1] leaves in level N-1 , then it cannot have more than 2^N - 2*bl_count[N-1] leaves in level N , level 0 being the root).

This guarantees that the code you create is a prefix code, but it does not guarantee that it is the same as the original creator intended. If for example the lengths list is invalid in a way that you can still create a valid prefix code, you cannot prove that this is the Huffman code, simply because you do not know the frequency of occurence for each symbol.

zlib checks that the list of code lengths is both complete, ie that it uses up all bit patterns, and that it does not overflow the bit patterns. The one allowed exception is when there is a single symbol with length 1, in which case the code is allowed to be incomplete (The bit 0 means that symbol, a 1 bit is undefined).

This helps zlib reject random, corrupted, or improperly coded data with higher probability and earlier in the stream. This is a different sort of robustness than what was suggested in another answer here, where you could alternatively permit incomplete codes and only return an error when an undefined code is encountered in the compressed data.

To calculate completeness, you start with the number of bits in the code k=1 , and the number of possible codes n=2 . There are two possible one-bit codes. You subtract from n the number of length 1 codes, n -= a[k] . Then you increment k to look at two-bit codes, and you double n . Subtract the number of two-bit codes. When you're done, n should be zero. If at any point n goes negative, you can stop right there as you have an invalid set of code lengths. If when you're done n is greater than zero, then you have an incomplete code.

You need to make sure that there is no input that will cause your code to execute illegal or undefined behavior, such as indexing off the end of an array, because such illegal inputs might be used to attack your code.

In my opinion, you should attempt to handle illegal but not dangerous inputs as gracefully as possible, so as to inter-operate with programs written by others which may interpret the specification in a different way than you have, or which have made small errors which have only one plausible interpretation. This is the Robustness principle - you can find discussions of this starting at http://en.wikipedia.org/wiki/Robustness_principle .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM