简体   繁体   English

如何以使用最少存储的方式将 Huffman 表保存在文件中?

[英]How to save a Huffman table in a file In a way that use the least storage?

It's my first question in stack overflow.这是我在堆栈溢出中的第一个问题。 it's long but I have explained it in detail and I think it's understandable.它很长,但我已经详细解释了它,我认为它是可以理解的。

I'm writing huffman code by c++ and saved characters and codes in a table like this:我正在编写 c++ 的霍夫曼代码,并将字符和代码保存在如下表格中:

Text: AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE文字:AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE

Table: (Made by huffman tree) Table表:(由霍夫曼树制作)

Now, I want to save this table to a file in the best way.现在,我想以最好的方式将此表保存到文件中。

I can't save like this: A1B001C010D001E000我不能这样保存:A1B001C010D001E000

When it change to bits: 01000001101000010001010000110100100010000101000101000当它变为位时:01000001101000010001010000110100100010000101000101000

Because I can't decode this.因为我无法解码。

If I save table in normal way, every character use 8 bit for saving it's code.如果我以正常方式保存表格,每个字符都使用 8 位来保存它的代码。

While my characters have 1bit or 3bit code.虽然我的角色有 1 位或 3 位代码。 (In this case.) (在这种情况下。)

this way use much storage.这种方式使用大量存储空间。

My idea is add a separator character and set a code for it.我的想法是添加一个分隔符并为其设置一个代码。

If we add a separator character and make huffman tree and write codes, have a table like this.如果我们添加一个分隔符并制作霍夫曼树并编写代码,就有一个这样的表。 table2表2

Now, we can write codes in this way.现在,我们可以用这种方式编写代码。

A0SepB110SepC100SepD1111sepE1110sep. A0SepB110SepC100SepD1111sepE1110sep。

binary= 0100000101010100001011010101000011100101010001001111101010001011110101二进制= 0100000101010100001011010101000011100101010001001111101010001011110101

I decode it in this way:我以这种方式对其进行解码:

sep = 101.九月 = 101。

  • Read 8 bit: 01000001 -> it's A.读取 8 位:01000001 -> 它是 A。

rest = 01010100001011010101000011100101010001001111101010001011110101. rest = 01010100001011010101000011100101010001001111101010001011110101。

  • Read 1 bit: 0 (unlike sep1)读取 1 位:0(与 sep1 不同)
  • Read 1 bit: 1 (like sep1), Read 1 bit: 0 (like sep2), Read 1 bit: 1 (like sep3(end))读取 1 位:1(如 sep1),读取 1 位:0(如 sep2),读取 1 位:1(如 sep3(end))
  • Sep was found so A = everything was befor sep = 0;找到了 Sep,所以 A = 一切都在 sep = 0 之前;

rest = 0100001011010101000011100101010001001111101010001011110101. rest = 0100001011010101000011100101010001001111101010001011110101。

  • Read 8 bit: 01000010 -> it's B.读取 8 位:01000010 -> 它是 B。

rest = 11010101000011100101010001001111101010001011110101. rest = 11010101000011100101010001001111101010001011110101。

  • Read 1 bit: 1 (like sep1)- Read 1 bit: 1 (unlike sep2)读取 1 位:1(与 sep1 类似)- 读取 1 位:1(与 sep2 不同)
  • Read 1 bit: 0 (unlike sep1)读取 1 位:0(与 sep1 不同)
  • Read 1 bit: 1 (like sep1) - Read 1 bit: 0 (like sep2) - Read 1 bit:1 (like sep3(end))读取 1 位:1(如 sep1) - 读取 1 位:0(如 sep2) - 读取 1 位:1(如 sep3(end))
  • Sep was found so B = everything was befor sep = 110;找到了 9 月,所以 B = 一切都在 sep = 110 之前;

And so on...等等...

This way still use a little storage for separator ( number of characters * separator size )这种方式仍然使用一点存储分隔符(字符数 * 分隔符大小)

My question: Is there a way to save first table in a file and use less storage?我的问题:有没有办法将第一个表保存在文件中并使用更少的存储空间?

For example like this: A1B001C010D001E000.例如像这样:A1B001C010D001E000。

Don't save the table with the codes.不要用代码保存表格。 Just save the lengths.只需保存长度。 See Canonical Huffman Code .请参阅规范霍夫曼代码

You can store the lengths of the codes (as Mark said) as a 256 byte header at the start of your compressed data.您可以在压缩数据的开头将代码的长度(如 Mark 所说)存储为 256 字节 header。 Each byte stores the length of the code, and because you're working with bytes with 256 possible values, and the huffman tree can only be of a certain depth (number of possible values - 1) you only need 8 bits to store the codes.每个字节存储代码的长度,并且因为您正在使用具有 256 个可能值的字节,并且霍夫曼树只能具有一定的深度(可能值的数量 - 1),所以您只需要 8 位来存储代码.

The first byte would store the code length of the value 0x00, the second byte stores the code length of 0x01, and so on and so forth.第一个字节将存储值 0x00 的代码长度,第二个字节存储 0x01 的代码长度,依此类推。

However, if compressing English text, there is a better way to store the table.但是,如果压缩英文文本,则有更好的存储表格的方法。 Store the shape of the tree, 0s for nodes and 1s for leaves.存储树的形状,0 代表节点,1 代表叶子。 Then, after you store the nodes and the leaves, you store the values of the leaves.然后,在存储节点和叶子之后,存储叶子的值。

The tree for AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE looks like this: AAAAAAAAAAAAAAAAAAAABBBBBBBCCCCDDDDEEEE的树如下所示:

        *
      /   \
     *     A
   /   \
  *     *
 / \   / \
E   D C   B

So you would store the shape of the tree as such: 000110111EDCBA所以你可以这样存储树的形状: 000110111EDCBA

The reason why storing the huffman codes in this way is better for when you are compressing English text is that storing the shape of the tree costs 10n - 1 bits (where n is the number of unique characters in the data you are trying to compress) while storing the code lengths costs a flat 2048 bits.以这种方式存储霍夫曼代码更适合压缩英文文本的原因是存储树的形状需要10n - 1位(其中n是您尝试压缩的数据中唯一字符的数量)而存储代码长度的成本2048位。 Therefore, for numbers of unique characters less than 205, storing the shape of the tree is more efficient, and because the average English string of text isn't going to have all that many of the possible 256 possible ASCII characters, you're usually better off storing the tree shape.因此,对于小于 205 的唯一字符数,存储树的形状更有效,并且因为普通的英文文本字符串不会包含那么多可能的 256 个 ASCII 字符,所以通常最好存储树形。

If you aren't just compressing text, and you're compressing more general data where there is a high likelihood that the number of unique characters could be greater than or equal to 205, you should probably use the code length storing format, or include 1 bit at the start of your header that says whether there's going to be a tree or a bunch of code lengths, and then write your decoder to decode either one depending on what that bit is set to.如果您不只是压缩文本,而且您正在压缩更一般的数据,其中唯一字符的数量很可能大于或等于 205,则您可能应该使用代码长度存储格式,或者包括header 开头的 1 位表示是否会有一棵树或一堆代码长度,然后根据该位的设置编写解码器来解码其中一个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM