简体   繁体   中英

How to write Huffman code to a binary file?

I have a sample .txt file that I want to compress using Huffman encoding. My problem is that if one character has a size of one byte and the smallest size you can write is a byte, how do I reduce the size of the sample file?

I converted the sample file into Huffman codes and wrote it to a new empty .txt file which just consists of 0s and 1s as one huge line of characters. Then I took the new file and used the BitSet class in Java to write to a binary file bit by bit. If the character was 0 or 1 in the new file, I wrote 0 or 1 respectively to the binary file. This process was very slow and it crashed my computer multiple times, I was hoping that someone had a more efficient solution. I have written all my code in Java.

One way is to use BitSet to set the bits that represent the code as you compute it. Then you can do either BitSet.toByteArray() or BitSet.toLongArray() and write out the information. Both of these store the bits in little endian encoding.

Do not write "0" and "1" characters to the file. Write 0 and 1 bits to the file.

You do this by accumulating eight bits into a byte buffer using the shift ( << ) and or ( | ) operators, and then writing that byte to the file. Repeat. At the end you may have less than eight bits in the byte buffer. If so, write that byte to the file, which will have the remaining bits filled with zeros.

Eg int buf = 0, count = 0; , for each bit: buf |= bit << count++; , check for eight: if (count == 8) { out.writeByte(buf); buf = count = 0; } if (count == 8) { out.writeByte(buf); buf = count = 0; } if (count == 8) { out.writeByte(buf); buf = count = 0; } . At the end, if (count > 0) out.writeByte(buf); .

When decoding the Huffman codes, you may run into a problem with those filler zero bits in the last byte. They could be decoded as an extraneous symbol or symbols. In order to deal with this you will need for the decoder to know when to stop, by either sending the number of symbols before the Huffman codes, or by adding a symbol for end-of-stream.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM