简体   繁体   English

存储哈夫曼树的有效方式

[英]Efficient way of storing Huffman tree

I am writing a Huffman encoding/decoding tool and am looking for an efficient way to store the Huffman tree that is created to store inside of the output file.我正在编写一个霍夫曼编码/解码工具,并且正在寻找一种有效的方法来存储为存储在输出文件中而创建的霍夫曼树。

Currently there are two different versions I am implementing.目前我正在实施两个不同的版本。

  1. This one reads the entire file into memory character by character and builds a frequency table for the whole document.这个是将整个文件逐个字符读入内存,并为整个文件建立一个频率表。 This would only require outputting the tree once, and thus efficiency is not that big of a concern, other than if the input file is small.这只需要输出一次树,因此效率不是那么重要,除非输入文件很小。
  2. The other method I am using is to read a chunk of data, about 64 kilobyte in size and run the frequency analysis over that, create a tree and encode it.我使用的另一种方法是读取大约 64 KB 大小的数据块并对其运行频率分析,创建一棵树并对其进行编码。 However, in this case before every chunk I will need to output my frequency tree so that the decoder is able to re-build its tree and properly decode the encoded file.但是,在这种情况下,在每个块之前,我需要输出我的频率树,以便解码器能够重新构建其树并正确解码编码文件。 This is where the efficiency does come into place since I want to save as much space as possible.这是效率确实发挥作用的地方,因为我想尽可能多地节省空间。

In my searches so far I have not found a good way of storing the tree in as little space as possible, I am hoping the StackOverflow community can help me find a good solution!到目前为止,在我的搜索中,我还没有找到在尽可能小的空间内存储树的好方法,我希望 StackOverflow 社区可以帮助我找到一个好的解决方案!

Since you already have to implement code to handle a bit-wise layer on top of your byte-organized stream/file, here's my proposal. 由于您已经必须实现代码以在字节组织的流/文件之上处理逐位层,因此这是我的提议。

Do not store the actual frequencies, they're not needed for decoding. 不存储实际频率,解码时不需要它们。 You do, however, need the actual tree. 但是,您确实需要实际的树。

So for each node, starting at root: 所以对于每个节点,从root开始:

  1. If leaf-node: Output 1-bit + N-bit character/byte 如果叶节点:输出1位+ N位字符/字节
  2. If not leaf-node, output 0-bit. 如果不是叶节点,则输出0位。 Then encode both child nodes (left first then right) the same way 然后以相同的方式编码两个子节点(左起第一个)

To read, do this: 要阅读,请执行以下操作:

  1. Read bit. 读位。 If 1, then read N-bit character/byte, return new node around it with no children 如果为1,则读取N位字符/字节,返回其周围没有子节点的新节点
  2. If bit was 0, decode left and right child-nodes the same way, and return new node around them with those children, but no value 如果位为0,则以相同的方式解码左右子节点,并使用这些子节点返回它们周围的新节点,但没有值

A leaf-node is basically any node that doesn't have children. 叶节点基本上是没有子节点的任何节点。

With this approach, you can calculate the exact size of your output before writing it, to figure out if the gains are enough to justify the effort. 使用这种方法,您可以在编写输出之前计算输出的确切大小,以确定增益是否足以证明该工作的合理性。 This assumes you have a dictionary of key/value pairs that contains the frequency of each character, where frequency is the actual number of occurrences. 这假设您有一个键/值对的字典,其中包含每个字符的频率,其中frequency是实际出现的次数。

Pseudo-code for calculation: 用于计算的伪代码:

Tree-size = 10 * NUMBER_OF_CHARACTERS - 1
Encoded-size = Sum(for each char,freq in table: freq * len(PATH(char)))

The tree-size calculation takes the leaf and non-leaf nodes into account, and there's one less inline node than there are characters. 树大小计算考虑了叶子和非叶子节点,并且内联节点比字符少一个。

SIZE_OF_ONE_CHARACTER would be number of bits, and those two would give you the number of bits total that my approach for the tree + the encoded data will occupy. SIZE_OF_ONE_CHARACTER将是位数,这两个将给出我对树+编码数据的方法将占用的总位数。

PATH(c) is a function/table that would yield the bit-path from root down to that character in the tree. PATH(c)是一个函数/表,它将产生从根到树中该字符的位路径。

Here's a C#-looking pseudo-code to do it, which assumes one character is just a simple byte. 这是一个C#查看伪代码,它假定一个字符只是一个简单的字节。

void EncodeNode(Node node, BitWriter writer)
{
    if (node.IsLeafNode)
    {
        writer.WriteBit(1);
        writer.WriteByte(node.Value);
    }
    else
    {
        writer.WriteBit(0);
        EncodeNode(node.LeftChild, writer);
        EncodeNode(node.Right, writer);
    }
}

To read it back in: 请阅读:

Node ReadNode(BitReader reader)
{
    if (reader.ReadBit() == 1)
    {
        return new Node(reader.ReadByte(), null, null);
    }
    else
    {
        Node leftChild = ReadNode(reader);
        Node rightChild = ReadNode(reader);
        return new Node(0, leftChild, rightChild);
    }
}

An example (simplified, use properties, etc.) Node implementation: 示例(简化,使用属性等)节点实现:

public class Node
{
    public Byte Value;
    public Node LeftChild;
    public Node RightChild;

    public Node(Byte value, Node leftChild, Node rightChild)
    {
        Value = value;
        LeftChild = leftChild;
        RightChild = rightChild;
    }

    public Boolean IsLeafNode
    {
        get
        {
            return LeftChild == null;
        }
    }
}

Here's a sample output from a specific example. 这是一个特定示例的示例输出。

Input: AAAAAABCCCCCCDDEEEEE 输入:AAAAAABCCCCCCDDEEEEE

Frequencies: 频率:

  • A: 6 答:6
  • B: 1 B:1
  • C: 6 C:6
  • D: 2 D:2
  • E: 5 E:5

Each character is just 8 bits, so the size of the tree will be 10 * 5 - 1 = 49 bits. 每个字符只有8位,因此树的大小为10 * 5 - 1 = 49位。

The tree could look like this: 树可能看起来像这样:

      20
  ----------
  |        8
  |     -------
 12     |     3
-----   |   -----
A   C   E   B   D
6   6   5   1   2

So the paths to each character is as follows (0 is left, 1 is right): 所以每个字符的路径如下(0为左,1为右):

  • A: 00 答:00
  • B: 110 B:110
  • C: 01 C:01
  • D: 111 D:111
  • E: 10 E:10

So to calculate the output size: 所以要计算输出大小:

  • A: 6 occurrences * 2 bits = 12 bits A:6次出现* 2位= 12位
  • B: 1 occurrence * 3 bits = 3 bits B:1次出现* 3位= 3位
  • C: 6 occurrences * 2 bits = 12 bits C:6次出现* 2位= 12位
  • D: 2 occurrences * 3 bits = 6 bits D:2次出现* 3位= 6位
  • E: 5 occurrences * 2 bits = 10 bits E:5次出现* 2位= 10位

Sum of encoded bytes is 12+3+12+6+10 = 43 bits 编码字节的总和是12 + 3 + 12 + 6 + 10 = 43位

Add that to the 49 bits from the tree, and the output will be 92 bits, or 12 bytes. 将其添加到树中的49位,输出将为92位或12个字节。 Compare that to the 20 * 8 bytes necessary to store the original 20 characters unencoded, you'll save 8 bytes. 将其与存储未编码的原始20个字符所需的20 * 8字节进行比较,您将节省8个字节。

The final output, including the tree to begin with, is as follows. 最终输出,包括开始的树,如下所示。 Each character in the stream (AE) is encoded as 8 bits, whereas 0 and 1 is just a single bit. 流(AE)中的每个字符编码为8位,而0和1只是一位。 The space in the stream is just to separate the tree from the encoded data and does not take up any space in the final output. 流中的空间只是将树与编码数据分开,并且不占用最终输出中的任何空间。

001A1C01E01B1D 0000000000001100101010101011111111010101010

For the concrete example you have in the comments, AABCDEF, you will get this: 对于你在评论AABCDEF中的具体例子,你会得到这个:

Input: AABCDEF 输入:AABCDEF

Frequencies: 频率:

  • A: 2 A2
  • B: 1 B:1
  • C: 1 C:1
  • D: 1 D:1
  • E: 1 E:1
  • F: 1 F:1

Tree: 树:

        7
  -------------
  |           4
  |       ---------
  3       2       2
-----   -----   -----
A   B   C   D   E   F
2   1   1   1   1   1

Paths: 路径:

  • A: 00 答:00
  • B: 01 B:01
  • C: 100 C:100
  • D: 101 D:101
  • E: 110 E:110
  • F: 111 F:111

Tree: 001A1B001C1D01E1F = 59 bits 树:001A1B001C1D01E1F = 59位
Data: 000001100101110111 = 18 bits 数据:000001100101110111 = 18位
Sum: 59 + 18 = 77 bits = 10 bytes 总和:59 + 18 = 77位= 10个字节

Since the original was 7 characters of 8 bits = 56, you will have too much overhead of such small pieces of data. 由于原始版本是8位= 56的7个字符,因此这些小块数据的开销过大。

If you have enough control over the tree generation, you could make it do a canonical tree (the same way DEFLATE does, for example), which basically means you create rules to resolve any ambiguous situations when building the tree. 如果你对树的生成有足够的控制,你可以让它做一个规范的树(例如,与DEFLATE一样),这基本上意味着你创建规则来解决构建树时的任何模糊情况。 Then, like DEFLATE, all you actually have to store are the lengths of the codes for each character. 然后,像DEFLATE一样,您实际需要存储的是每个字符的代码长度。

That is, if you had the tree/codes Lasse mentioned above: 也就是说,如果你有上面提到的树/代码Lasse:

  • A: 00 答:00
  • B: 110 B:110
  • C: 01 C:01
  • D: 111 D:111
  • E: 10 E:10

Then you could store those as: 2, 3, 2, 3, 2 然后你可以将它们存储为:2,3,2,3,2

And that's actually enough information to regenerate the huffman table, assuming you're always using the same character set -- say, ASCII. 这实际上足以重新生成霍夫曼表,假设你总是使用相同的字符集 - 比如ASCII。 (Which means you couldn't skip letters -- you'd have to list a code length for each one, even if it's zero.) (这意味着你不能跳过字母 - 你必须列出每个字母的代码长度,即使它是零。)

If you also put a limitation on the bit lengths (say, 7 bits), you could store each of these numbers using short binary strings. 如果您还对位长度(例如,7位)进行了限制,则可以使用短二进制字符串存储这些数字中的每一个。 So 2,3,2,3,2 becomes 010 011 010 011 010 -- Which fits in 2 bytes. 因此2,3,2,3,2变为010 011 010 011 010 - 其中2个字节。

If you want to get really crazy, you could do what DEFLATE does, and make another huffman table of the lengths of these codes, and store its code lengths beforehand. 如果你想变得非常疯狂,你可以做DEFLATE做的事情,并制作另一个这些代码长度的霍夫曼表,并预先存储它的代码长度。 Especially since they add extra codes for "insert zero N times in a row" to shorten things further. 特别是因为他们为“连续N次插入零”添加额外的代码以进一步缩短事物。

The RFC for DEFLATE isn't too bad, if you're already familiar with huffman coding: http://www.ietf.org/rfc/rfc1951.txt 如果你已经熟悉霍夫曼编码,那么DEFLATE的RFC也不算太糟糕: http ://www.ietf.org/rfc/rfc1951.txt

branches are 0 leaves are 1. Traverse the tree depth first to get its "shape" 分支是0叶是1.首先遍历树深度以获得其“形状”

e.g. the shape for this tree

0 - 0 - 1 (A)
|    \- 1 (E)
  \
    0 - 1 (C)
     \- 0 - 1 (B)
         \- 1 (D)

would be 001101011

Follow that with the bits for the characters in the same depth first order AECBD (when reading you'll know how many characters to expect from the shape of the tree). 按照相同深度的字符位第一顺序AECBD(当读取时你会知道树的形状有多少个字符)。 Then output the codes for the message. 然后输出消息的代码。 You then have a long series of bits that you can divide up into characters for output. 然后,您有一长串的位,您可以将它们分成输出字符。

If you are chunking it, you could test that storing the tree for the next chuck is as efficient as just reusing the tree for the previous chunk and have the tree shape being "1" as an indicator to just reuse the tree from the previous chunk. 如果你正在对它进行分块,你可以测试为下一个chuck存储树的效率就像重新使用前一个块的树一样高效,并且树形状为“1”作为指示器,只重用上一个块中的树。

The tree is generally created from a frequency table of the bytes. 树通常根据字节的频率表创建。 So store that table, or just the bytes themselves sorted by frequency, and re-create the tree on the fly. 因此,存储该表,或只是按频率排序的字节,并动态重新创建树。 This of course assumes that you're building the tree to represent single bytes, not larger blocks. 这当然假设您构建树来表示单个字节,而不是更大的块。

UPDATE : As pointed out by j_random_hacker in a comment, you actually can't do this: you need the frequency values themselves. 更新 :正如j_random_hacker在评论中所指出的,你实际上不能这样做:你需要自己的频率值。 They are combined and "bubble" upwards as you build the tree. 当你构建树时,它们被组合并向上“冒泡”。 This page describes the way a tree is built from the frequency table. 此页面描述了从频率表构建树的方式。 As a bonus, it also saves this answer from being deleted by mentioning a way to save out the tree: 作为奖励,它还通过提及保存树的方法来保存此答案:

The easiest way to output the huffman tree itself is to, starting at the root, dump first the left hand side then the right hand side. 输出霍夫曼树本身最简单的方法是从根部开始,首先是左手侧,然后是右手侧。 For each node you output a 0, for each leaf you output a 1 followed by N bits representing the value. 对于每个节点,输出0,对于每个叶子,输出1,后跟表示该值的N位。

A better approach 更好的方法

Tree: 树:

           7
     -------------
     |           4
     |       ---------
     3       2       2
   -----   -----   -----
   A   B   C   D   E   F
   2   1   1   1   1   1 : frequencies
   2   2   3   3   3   3 : tree depth (encoding bits)

Now just derive this table: 现在只需得出这个表:

   depth number of codes
   ----- ---------------
     2   2 [A B]
     3   4 [C D E F]

You don't need to use the same binary tree, just keep the computed tree depth ie the number of encoding bits. 您不需要使用相同的二叉树,只需保留计算的树深度即编码位数。 So just keep the vector of uncompressed values [ABCDEF] ordered by tree depth, use relative indexes instead to this separate vector. 因此,只需按树深度排序未压缩值[ABCDEF]的向量,使用相对索引代替此单独的向量。 Now recreate the aligned bit patterns for each depth: 现在重新创建每个深度的对齐位模式:

   depth number of codes
   ----- ---------------
     2   2 [00x 01x]
     3   4 [100 101 110 111]

What you immediately see is that only the first bit pattern in each row is significant. 您立即看到的是,每行中只有第一位模式是重要的。 You get the following lookup table: 您将获得以下查找表:

    first pattern depth first index
    ------------- ----- -----------
    000           2     0
    100           3     2

This LUT has a very small size (even if your Huffman codes can be 32-bit long, it will only contain 32 rows), and in fact the first pattern is always null, you can ignore it completely when performing a binary search of patterns in it (here only 1 pattern will need to be compared to know if the bit depth is 2 or 3 and get the first index at which the associated data is stored in the vector). 这个LUT的大小非常小(即使你的霍夫曼代码可以是32位长,它只包含32行),实际上第一个模式总是为null,你可以在执行模式的二进制搜索时完全忽略它在其中(这里只需要比较1个模式以知道位深度是2还是3并获得相关数据存储在向量中的第一个索引)。 In our example you'll need to perform a fast binary search of input patterns in a search space of 31 values at most, ie a maximum of 5 integer compares. 在我们的示例中,您需要在最多31个值的搜索空间中对输入模式执行快速二进制搜索,即最多5个整数比较。 These 31 compare routines can be optimized in 31 codes to avoid all loops and having to manage states when browing the integer binary lookup tree. 这31个比较例程可以在31个代码中进行优化,以避免所有循环,并且在浏览整数二进制查找树时必须管理状态。 All this table fits in small fixed length (the LUT just needs 31 rows atmost for Huffman codes not longer than 32 bits, and the 2 other columns above will fill at most 32 rows). 所有这些表都适合小的固定长度(对于不超过32位的霍夫曼码,LUT最多需要31行,而上面的其他2列最多将填充32行)。

In other words the LUT above requires 31 ints of 32-bit size each, 32 bytes to store the bit depth values: but you can avoid it this by implying the depth column (and the first row for depth 1): 换句话说,上面的LUT需要31个32位大小的整数,32个字节来存储位深度值:但是你可以通过暗示深度列(以及深度1的第一行)来避免它:

    first pattern (depth) first index
    ------------- ------- -----------
    (000)          (1)    (0)
     000           (2)     0
     100           (3)     2
     000           (4)     6
     000           (5)     6
     ...           ...     ...
     000           (32)    6

So your LUT contains [000, 100, 000(30times)]. 所以你的LUT包含[000,100,000(30次)]。 To search in it you must find the position where the input bits pattern are between two patterns: it must be lower than the pattern at the next position in this LUT but still higher than or equal to the pattern in the current position (if both positions contain the same pattern, the current row will not match, the input pattern fits below). 要在其中搜索,您必须找到输入位模式在两个模式之间的位置:它必须低于此LUT中下一个位置的模式,但仍然高于或等于当前位置中的模式(如果两个位置都是如此)包含相同的模式,当前行不匹配,输入模式适合下面)。 You'll then divide and conquer, and will use 5 tests at most (the binary search requires a single code with 5 embedded if/then/else nested levels, it has 32 branches, the branch reached indicates directly the bit depth that does not need to be stored; you perform then a single directly indexed lookup to the second table for returning the first index; you derive additively the final index in the vector of decoded values). 然后你将分而治之,并且最多将使用5个测试(二进制搜索需要一个代码,其中5个嵌入if / then / else嵌套级别,它有32个分支,到达的分支直接指示不具有的位深度需要存储;然后对第二个表执行单个直接索引查找以返回第一个索引;您可以在解码值的向量中附加地导出最终索引)。

Once you get a position in the lookup table (search in the 1st column), you immediately have the number of bits to take from the input and then the start index to the vector. 一旦在查找表中找到一个位置(在第一列中搜索),就会立即获得从输入中获取的位数,然后从起始索引到向量。 The bit depth you get can be used to derive directly the adjusted index position, by basic bitmasking after substracting the first index. 您获得的位深度可用于在减去第一个索引后通过基本位掩码直接导出调整后的索引位置。

In summary: never store linked binary trees, and you don't need any loop to perform thelookup which just requires 5 nested ifs comparing patterns at fixed positions in a table of 31 patterns, and a table of 31 ints containing the start offset within the vector of decoded values (in the first branch of the nested if/then/else tests, the start offset to the vector is implied, it is always zero; it is also the most frequent branch that will be taken as it matches the shortest code which is for the most frequent decoded values). 总结:从不存储链接的二进制树,并且您不需要任何循环来执行查找,这只需要5个嵌套ifs比较31个模式的表中固定位置的模式,以及包含31个模式中的起始偏移量的31个整数的表。解码值的向量(在嵌套的if / then / else测试的第一个分支中,暗示了向量的起始偏移量,它总是为零;它也是最常用的分支,因为它匹配最短的代码这是最频繁的解码值)。

There are two main ways to store huffman code LUTs as the other answers state.有两种主要方法可以将霍夫曼代码 LUT 存储为其他答案状态。 You can either store the geometry of the tree, 0 for a node, 1 for a leaf, then put in all the leaf values, or you can use canonical huffman encoding, storing the lengths of the huffman codes.您可以存储树的几何形状,节点为 0,叶子为 1,然后放入所有叶子值,或者您可以使用规范的霍夫曼编码,存储霍夫曼代码的长度。

The thing is, one method is better than the other depending on the circumstances.问题是,根据情况,一种方法比另一种更好。 Let's say, the number of unique symbols in the data you wish to compress ( aabbbcdddd , there are 4 unique symbols, a, b, c, d ) is n .假设您希望压缩的数据中唯一符号的数量( aabbbcdddd ,有 4 个唯一符号, a, b, c, d )是n

The number of bits to store the geometry of the tree along side the symbols in the tree is 10n - 1 .沿树中符号存储树几何的位数为10n - 1

Assuming you store the code lengths in order of the symbols the code lengths are for, and that the code lengths are 8 bits (code lengths for a 256 symbol alphabet will not exceed 8 bits), the size of the code length table will be a flat 2048 bits.假设您按照码长对应的符号顺序存储码长,并且码长为 8 位(256 个符号字母表的码长不会超过 8 位),则码长表的大小将是平 2048 位。

When you have a high number of unique symbols, say 256, it will take 2559 bits to store the geometry of the tree.当您有大量唯一符号时,例如 256 个,将需要 2559 位来存储树的几何形状。 In this case, the code length table is much more efficient.在这种情况下,代码长度表的效率要高得多。 511 bits more efficient, to be exact.准确地说,效率提高了 511 位。

But if you only have 5 unique symbols, the tree geometry only takes 49 bits, and in this case, when compared to storing the code length table, storing the tree geometry is almost 2000 bits better.但是如果你只有 5 个唯一符号,那么树几何只需要 49 位,在这种情况下,与存储代码长度表相比,存储树几何要好近 2000 位。

The tree geometry is most efficient for n < 205 , while a code length table is more efficient for n >= 205 .树几何对于n < 205最有效,而代码长度表对于n >= 205更有效。 So, why not get the best of both worlds, and use both?那么,为什么不两全其美,并同时使用两者呢? Have 1 bit at the start of your compressed data represent whether the next however many bits are going to be in the format of a code length table, or the geometry of the huffman tree.在压缩数据的开头有 1 位表示接下来的多少位将采用代码长度表的格式,还是霍夫曼树的几何结构。

In fact, why not add two bits, and when both of them are 0, there is no table, the data is uncompressed.其实为什么不加两个bit,而且都是0的时候,没有表,数据是解压的。 Because sometimes, you can't get compression!因为有时,您无法获得压缩! And it would be best to have a single byte at the beginning of your file that is 0x00 telling your decoder not to worry about doing anything.最好在文件开头有一个字节 0x00 告诉您的解码器不要担心做任何事情。 Saves space by not including the table or geometry of a tree, and saves time, not having to unnecessarily compress and decompress data.通过不包括树的表或几何图形来节省空间,并节省时间,不必对数据进行不必要的压缩和解压缩。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM