简体繁体 English

如何将霍夫曼编码写入二进制文件？

[英]How to write Huffman code to a binary file?

原文 2023-01-13 17:11:54 6 2 java/ compression/ huffman-code

I have a sample .txt file that I want to compress using Huffman encoding.我有一个示例.txt文件，我想使用霍夫曼编码对其进行压缩。 My problem is that if one character has a size of one byte and the smallest size you can write is a byte, how do I reduce the size of the sample file?我的问题是，如果一个字符的大小为一个字节，而您可以写入的最小大小为一个字节，我该如何减小示例文件的大小？

I converted the sample file into Huffman codes and wrote it to a new empty .txt file which just consists of 0s and 1s as one huge line of characters.我将示例文件转换为霍夫曼编码，并将其写入一个新的空.txt文件，该文件仅由 0 和 1 作为一大行字符组成。 Then I took the new file and used the BitSet class in Java to write to a binary file bit by bit.然后我拿着新文件，用Java中的BitSet class一点一点的写入二进制文件。 If the character was 0 or 1 in the new file, I wrote 0 or 1 respectively to the binary file.如果新文件中的字符是0或1，我就分别向二进制文件写入0或1。 This process was very slow and it crashed my computer multiple times, I was hoping that someone had a more efficient solution.这个过程非常缓慢，它多次使我的计算机崩溃，我希望有人有更有效的解决方案。 I have written all my code in Java.我已经在 Java 中编写了我所有的代码。

2 个解决方案

One way is to use BitSet to set the bits that represent the code as you compute it.一种方法是使用BitSet在计算时设置代表代码的位。 Then you can do either BitSet.toByteArray() or BitSet.toLongArray() and write out the information.然后您可以执行BitSet.toByteArray()或BitSet.toLongArray()并写出信息。 Both of these store the bits in little endian encoding.这两个都以little endian编码存储位。

Do not write "0" and "1" characters to the file.不要向文件写入"0"和"1"字符。 Write 0 and 1 bits to the file.将0和1位写入文件。

You do this by accumulating eight bits into a byte buffer using the shift ( << ) and or ( | ) operators, and then writing that byte to the file.为此，您可以使用移位 ( << ) 和或 ( | ) 运算符将八位累加到字节缓冲区中，然后将该字节写入文件。 Repeat.重复。 At the end you may have less than eight bits in the byte buffer.最后，字节缓冲区中可能只有不到八位。 If so, write that byte to the file, which will have the remaining bits filled with zeros.如果是这样，将该字节写入文件，其余位将用零填充。

Eg int buf = 0, count = 0;例如int buf = 0, count = 0; , for each bit: buf |= bit << count++; ，对于每一位： buf |= bit << count++; , check for eight: if (count == 8) { out.writeByte(buf); buf = count = 0; } , 检查八个： if (count == 8) { out.writeByte(buf); buf = count = 0; } if (count == 8) { out.writeByte(buf); buf = count = 0; } if (count == 8) { out.writeByte(buf); buf = count = 0; } . if (count == 8) { out.writeByte(buf); buf = count = 0; } 。 At the end, if (count > 0) out.writeByte(buf);最后， if (count > 0) out.writeByte(buf); . .

When decoding the Huffman codes, you may run into a problem with those filler zero bits in the last byte.解码霍夫曼码时，您可能会遇到最后一个字节中的那些填充零位的问题。 They could be decoded as an extraneous symbol or symbols.它们可以被解码为一个或多个无关的符号。 In order to deal with this you will need for the decoder to know when to stop, by either sending the number of symbols before the Huffman codes, or by adding a symbol for end-of-stream.为了解决这个问题，您需要让解码器知道何时停止，方法是在霍夫曼代码之前发送符号数，或者为流结束添加符号。