简体   繁体   English

C#-对大文件进行霍夫曼编码需要太长时间

[英]C# - Huffman coding for a large file takes too long

I am trying to implement Huffman coding in C#. 我正在尝试在C#中实现霍夫曼编码。 I have a problem with encoding large files as it takes too much time. 我对大型文件进行编码存在问题,因为这需要花费太多时间。 For example to encode a 11MiB binary file it takes 10 seconds in debug mode. 例如,要编码11MiB二进制文件,在调试模式下需要10秒。 And I did not even bother waiting for my program to finish with 27MiB file. 而且,我什至不必费心等待程序完成27MiB文件。

Here is the problematic loop: 这是有问题的循环:

            BitArray bits = new BitArray(8);
            byte[] byteToWrite = new byte[1];
            byte bitsSet = 0;

            while ((bytesRead = inputStream.Read(buffer, 0, 4096)) > 0) // Read input in chunks
            {
                for (int i = 0; i < bytesRead; i++)
                {
                    for (int j = 0; j < nodesBitStream[buffer[i]].Count; j++)
                    {
                        if (bitsSet != 8)
                        {
                            bits[bitsSet] = nodesBitStream[buffer[i]][j];
                            bitsSet++;
                        }
                        else
                        {
                            bits.CopyTo(byteToWrite, 0);
                            outputStream.Write(byteToWrite, 0, byteToWrite.Length);
                            bits = new BitArray(8);
                            bitsSet = 0;

                            bits[bitsSet] = nodesBitStream[buffer[i]][j];
                            bitsSet++;
                        }
                    }
                }
            }

nodesBitStream is a Dictionary<byte, List<bool>> . nodesBitStream是一个Dictionary<byte, List<bool>> The List<bool> is a representation of path from Huffman tree root to the leaf node containing specific symbol represented as byte . List<bool>是从霍夫曼树根到包含特定符号的叶节点的路径的表示,表示为byte

So I am accumulating bits to form a byte which I write to a encoded file. 因此,我正在累积位以形成一个字节,然后将其写入已编码的文件。 It is quite obvious that this can take very long time but I have not figured out some other way just yet. 很明显,这可能需要很长时间,但是我还没有找到其他方法。 Therefore I am asking for advice on how to speed up the process. 因此,我正在寻求有关如何加快流程的建议。

Working bit by bit is a lot of extra work. 一点一点地工作是很多额外的工作。 Also while a Dictionary<byte, TVal> is decent, a plain array is even faster. 同样,虽然Dictionary<byte, TVal>很不错,但普通数组甚至更快。

The Huffman codes can also be represented as a pair of integers, one for the length (in bits) and the other holding the bits. 霍夫曼码也可以表示为一对整数,一个整数表示长度(以位为单位),另一个表示位数。 In this representation, you can process a symbol in a couple of fast operations, for example (not tested): 在此表示形式中,您可以通过几个快速操作来处理符号,例如(未测试):

BinaryWriter w = new BinaryWriter(outStream);
uint buffer = 0;
int bufbits = 0;
for (int i = 0; i < symbols.Length; i++)
{
    int s = symbols[i];
    buffer <<= lengths[s];  // make room for the bits
    bufbits += lengths[s];  // buffer got longer
    buffer |= values[s];    // put in the bits corresponding to the symbol

    while (bufbits >= 8)    // as long as there is at least a byte in the buffer
    {
        bufbits -= 8;       // forget it's there
        w.Write((byte)(buffer >> bufbits)); // and save it
    }
}
if (bufbits != 0)
    w.Write((byte)(buffer << (8 - bufbits)));

Or some variant, for example you could fill bytes the other way around, or save up bytes in an array and do bigger writes, etc. 或某些变体,例如,您可以以其他方式填充字节,或将字节保存在数组中并进行更大的写入等。

This code requires code lengths to be limited to 25 bits max, usually other requirements lower that limit even further. 该代码要求将代码长度限制为最大25位,通常其他要求会将该限制进一步降低。 Huge code lengths are not needed to get a good compression ratio. 不需要很大的代码长度即可获得良好的压缩率。

I don't really know how the algorithm works, but looking at your code two things kind of stick out: 我真的不知道算法是如何工作的,但是看一下您的代码有两点值得一提:

  1. You seem to be using a dictionary to index in with a byte. 您似乎正在使用字典来索引一个字节。 Maybe a simple List<bool>[] is faster, using buffer[i] to index into it. 也许简单的List<bool>[]更快,使用buffer[i]进行索引。 The memory price you would be paying is rather low. 您要支付的内存价格相当低。 Using an array you would be exchanging look ups with offsets which are faster. 使用数组可以交换具有更快偏移量的查找。 You are doing quite a few lookups there. 您正在那里进行大量查找。

  2. Why are you instantiating bits on every iteration? 为什么要在每次迭代中实例化bits Depending on how many iterations you are doing that can end up putting pressure on the GC . 根据您执行的迭代次数,最终可能给GC施加压力。 There seems to be no need, you are essentially overwriting every bit and spitting it out every 8 bits, so simply overwrite it, don't new it up; 似乎没有必要,您实际上是在覆盖每个位,然后每8位将其吐出,因此只需覆盖它,而不是对其进行更新; use the same instance over and over. 一遍又一遍地使用同一实例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM