简体   繁体   English

我应该在霍夫曼树的DEFLATE算法中提供一致性检查吗?

[英]Should I provide consistency checks in the Huffman tree building algorithm for DEFLATE?

In RFC-1951 there is a simple algorithm that restores the Huffman tree from a list of code lengths, described following way: 在RFC-1951中,有一种简单的算法可从代码长度列表中还原霍夫曼树,方法如下:

     1)  Count the number of codes for each code length.  Let
         bl_count[N] be the number of codes of length N, N >= 1.

     2)  Find the numerical value of the smallest code for each
         code length:

            code = 0;
            bl_count[0] = 0;
            for (bits = 1; bits <= MAX_BITS; bits++) {
                code = (code + bl_count[bits-1]) << 1;
                next_code[bits] = code;
            }

     3)  Assign numerical values to all codes, using consecutive
         values for all codes of the same length with the base
         values determined at step 2. Codes that are never used
         (which have a bit length of zero) must not be assigned a
         value.

            for (n = 0;  n <= max_code; n++) {
                len = tree[n].Len;
                if (len != 0) {
                    tree[n].Code = next_code[len];
                    next_code[len]++;
                }

But there is no any data consistency checks in the algorithm. 但是该算法中没有任何数据一致性检查。 On the other hand is it obvious that the lengths list can be invalid. 另一方面,长度列表显然是无效的。 The length values, because of encoding in 4 bits can not be invalid, but, for example, there can be more codes than can be encoded for some code length. 长度值不会因为4位编码而无效,但是例如,对于某些代码长度,可能存在比可以编码的更多的代码。

What is the minimal set of checks that will provide data validation? 提供数据验证的最少检查是什么? Or such checks are not needed for some reason that I missed? 还是由于错过了某些原因而不需要此类检查?

I think that checking that next_code[len] does not overflow past its respective bits is enough. 我认为检查next_code[len]不会溢出其各自的位就足够了。 So after tree[n].Code = next_code[len]; 所以在tree[n].Code = next_code[len]; , you can do the following check: ,您可以进行以下检查:

if (tree[n].Code & ((1<<len)-1) == 0)
    print(Error)

If tree[n].Code & ((1<<len)-1) reaches 0, it means that there are more codes of length len than they should, so the lengths list had an error in it. 如果tree[n].Code & ((1<<len)-1)达到0,则意味着长度为len代码多于应有的长度,因此长度列表中有错误。 On the other hand, if every symbol of the tree is assigned a valid (unique) code, then you have created a correct Huffman tree. 另一方面,如果为树的每个符号分配了有效(唯一)代码,则您已创建了正确的霍夫曼树。

EDIT: It just dawned on me: You can simply make the same check at the end of step one: You just have to check that bl_count[N] <= 2^N - SUM((2^j)*bl_count[Nj]) for all 1<=j<=N and for all N >=1 (If a binary tree has bl_count[N-1] leaves in level N-1 , then it cannot have more than 2^N - 2*bl_count[N-1] leaves in level N , level 0 being the root). 编辑:这只是我的曙光:您可以简单地在第一步结束时进行相同的检查:您只需要检查bl_count[N] <= 2^N - SUM((2^j)*bl_count[Nj])对于所有1<=j<=N以及对于所有N >=1 (如果二叉树在级别N-1具有bl_count[N-1]叶子,则它的2^N - 2*bl_count[N-1]不能超过2^N - 2*bl_count[N-1]离开级别N ,级别0为根。

This guarantees that the code you create is a prefix code, but it does not guarantee that it is the same as the original creator intended. 这样可以保证您创建的代码是前缀代码,但不能保证它与原始创建者的意图相同。 If for example the lengths list is invalid in a way that you can still create a valid prefix code, you cannot prove that this is the Huffman code, simply because you do not know the frequency of occurence for each symbol. 例如,如果长度列表无效(仍然可以创建有效的前缀代码),则无法证明这是霍夫曼代码,仅仅是因为您不知道每个符号的出现频率。

zlib checks that the list of code lengths is both complete, ie that it uses up all bit patterns, and that it does not overflow the bit patterns. zlib检查代码长度列表是否完整,即是否用完了所有位模式,并且没有溢出位模式。 The one allowed exception is when there is a single symbol with length 1, in which case the code is allowed to be incomplete (The bit 0 means that symbol, a 1 bit is undefined). 一个允许的例外是当单个符号的长度为1时,在这种情况下,允许代码不完整(位0表示该符号,未定义1位)。

This helps zlib reject random, corrupted, or improperly coded data with higher probability and earlier in the stream. 这有助于zlib以更高的概率并且在流中更早地拒绝随机,损坏或不正确编码的数据。 This is a different sort of robustness than what was suggested in another answer here, where you could alternatively permit incomplete codes and only return an error when an undefined code is encountered in the compressed data. 这与此处的另一个答案所建议的健壮性不同,在健壮性中,您可以允许不完整的代码,并且仅在压缩数据中遇到未定义的代码时才返回错误。

To calculate completeness, you start with the number of bits in the code k=1 , and the number of possible codes n=2 . 要计算完整性,请从代码k=1的位数开始,然后从可能的代码n=2 There are two possible one-bit codes. 有两种可能的一位编码。 You subtract from n the number of length 1 codes, n -= a[k] . 您从n减去长度为1的代码数n -= a[k] Then you increment k to look at two-bit codes, and you double n . 然后,将k递增以查看两位代码,然后将n加倍。 Subtract the number of two-bit codes. 减去两位代码的数量。 When you're done, n should be zero. 完成后, n应该为零。 If at any point n goes negative, you can stop right there as you have an invalid set of code lengths. 如果在任何时候n变为负数,则可以在此处停止,因为您有一组无效的代码长度。 If when you're done n is greater than zero, then you have an incomplete code. 如果完成后n大于零,则您的代码不完整。

You need to make sure that there is no input that will cause your code to execute illegal or undefined behavior, such as indexing off the end of an array, because such illegal inputs might be used to attack your code. 您需要确保没有输入会导致您的代码执行非法或未定义的行为,例如索引数组末尾,因为此类非法输入可能会被用来攻击您的代码。

In my opinion, you should attempt to handle illegal but not dangerous inputs as gracefully as possible, so as to inter-operate with programs written by others which may interpret the specification in a different way than you have, or which have made small errors which have only one plausible interpretation. 我认为,您应该尝试尽可能优雅地处理非法但不是危险的输入,以便与其他人编写的程序进行互操作,这些程序可能以不同于您的方式解释规范,或者犯了一些小错误,只有一种合理的解释。 This is the Robustness principle - you can find discussions of this starting at http://en.wikipedia.org/wiki/Robustness_principle . 这是鲁棒性原则-您可以从http://en.wikipedia.org/wiki/Robustness_principle开始对此问题进行讨论。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM