简体   繁体   中英

Huffman tree for big files

I've been searching the Internet but couldn't find what I need.

I have to compress big files using the Huffman coding. My idea was to read the first 1-2MB of the file

(to avoid first reading the whole file to build the tree, and then reading it once more to encode it, avoiding O(2n) ),

and build the Huffman tree. If any of the 256 alphabet byte was missing, I'd add it by myself, in case it appears later in the file(and not in the first 1-2 MBs). But trying to test the result using this:

int * totalFr = new int[256];
unsigned char * symArr= new  unsigned char[256];

for (int i = 0; i < 256; i++)
{
    totalFr[i] = i;
    symArr[i] = unsigned char(i);
}

int size = sizeof(symArr) / sizeof(symArr[0]);
buildHuffmanTree(totalFr,symArr, size );
delete[] totalFr;
delete[] arrei;

where buildHuffmanTree is a function, which builds the Huffman tree, made my realise the best character code I could get was 7 bits, for example 0000001 .

And this is where my question came from - is it worth it to build Huffman Tree for a full 256 words alphabet? Or is it better to use adaptive Huffman Coding for chunks like 1-2MB

You can't expect a whole lot out of just Huffman coding unless the data is extremely biased with respect to which bytes are present. I just tried on a 100 MB file of English text from Wikipedia. It got the file down to 63% of its original size, so maybe eight bits down to five bits on average. Also that was doing the Huffman in blocks of about 16 KB at a time, so that the code was adapted to each block.

Normal zlib compression, which also looks for matching strings, gets it down to 35% of the original size. More advanced compressors, such as xz which spend more time and memory looking harder and farther for matching strings as well as do a little better than Huffman coding, get it down to 26% of the original size.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM