简体   繁体   中英

How to efficiently decompress huffman coded file

I've found a lot of questions asking this but some of the explanations were very difficult to understand and I couldn't quite grasp the concept of how to efficiently decompress the file. I have found these related questions: Huffman code with lookup table How to decode huffman code quickly?

But I fail to understand the explanation. I know how to encode and decode a huffman tree regularly. Right now in my compression program I can write any of the following information to file symbol huffman code (unsigned long) huffman code length

What I plan to do is get a text file, separate it into small text files and compress each individually and then decompress that file by sending all the small compressed files with their respective lookup table (don't know how to do this part) to a Nvidia GPU to try to decompress the file in parallel using some sort of look up table.

I have 3 questions: What information should I write to file in the header to construct the look up table? How do I recreate this table from file? How do I use it to decode the huffman encoded file quickly?

Don't bother writing it yourself, unless this is a didactic exercise. Use zlib , lz4 , or any of several other free compression/decompression libraries out there that are far better tested than anything you'll be able to do.

You are only talking about Huffman coding, indicating that you would only get a small portion of the available compression. Most of the compression in the libraries mentioned come from matching strings. Look up "LZ77".

As for efficient Huffman decoding, you can look at how zlib's inflate does it. It creates a lookup table for the most-significant nine bits of the code. Each entry in the table has either a symbol and numbers of bits for that code (less than or equal to nine), or if the provided nine bits is a prefix of a longer code, that entry has a pointer to another table to resolve the rest of the code and the number of bits needed for that secondary table. (There are several of these secondary tables.) There are multiple entries for the same symbol if the code length is less than nine. In fact, 2 9-n multiple entries for an n-bit code.

So to decode you get nine bits from the input and get the entry from the table. If it is a symbol, then you remove the number of bits indicated for the code from your stream and emit the symbol. If it is a pointer to a secondary table, then you remove nine bits from the stream, get the number of bits indicated by the table, and look it up there. Now you will definitely get a symbol to emit, and the number of remaining bits to remove from the stream.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM