简体   繁体   English

GZIP解压缩C#OutOfMemory

[英]GZIP decompression C# OutOfMemory

I have many large gzip files (approximately 10MB - 200MB) that I downloaded from ftp to be decompressed. 我有很多大的gzip文件(大约10MB - 200MB),我从ftp下载解压缩。

So I tried to google and find some solution for gzip decompression. 所以我试着google并找到一些gzip解压缩的解决方案。

    static byte[] Decompress(byte[] gzip)
    {
        using (GZipStream stream = new GZipStream(new MemoryStream(gzip), CompressionMode.Decompress))
        {
            const int size = 4096;
            byte[] buffer = new byte[size];
            using (MemoryStream memory = new MemoryStream())
            {
                int count = 0;
                do
                {
                    count = stream.Read(buffer, 0, size);
                    if (count > 0)
                    {
                        memory.Write(buffer, 0, count);
                    }
                }
                while (count > 0);
                return memory.ToArray();
            }
        }
    }

it works well for any files below 50mb but once i have input more than 50mb I got system out of memory exception. 它适用于50mb以下的任何文件但是一旦我输入超过50mb我得到系统内存异常。 Last position and the length of memory before exception is 134217728. I don't think it has relation with my physical memory, I understand that I can't have object more than 2GB since I use 32-bit. 异常之前的最后位置和内存长度是134217728.我不认为它与我的物理内存有关系,我知道我使用32位时不能有超过2GB的对象。

I also need to process the data after decompress the files. 我还需要在解压缩文件后处理数据。 I'm not sure if memory stream is the best approach here but I don't really like write to file and then read the files again. 我不确定内存流是否是最好的方法,但我不喜欢写入文件,然后再次读取文件。

My questions 我的问题

  • why did I get System.OutMemoryException? 为什么我得到System.OutMemoryException?
  • what is the best possible solution to decompress gzip files and do some text processing afterwards? 什么是解压缩gzip文件并在之后进行一些文本处理的最佳解决方案?

Memory allocation strategy for MemoryStream is not friendly for huge amounts of data. MemoryStream的内存分配策略对大量数据不友好。

Since contract for MemoryStream is to have contiguous array as underlying storage it has to reallocate array often enough for large stream (often as log2(size_of_stream)). 由于MemoryStream的契约是将连续数组作为底层存储,因此它必须经常为大流重新分配数组(通常为log2(size_of_stream))。 Side effects of such reallocation are 这种重新分配的副作用是

  • long copy delays on reallocation 重新分配的长拷贝延迟
  • new array must fit in free address space already heavily fragmented by previous allocations 新数组必须适合以前分配已经严重分散的空闲地址空间
  • new array will be on LOH heap that have its quirks (no compaction, collection on GC2). 新数组将在LOH堆上有它的怪癖(没有压缩,GC2上的集合)。

As result handling large (100Mb+) stream through MemoryStream will likely case out of memory exception on x86 systems. 因此,通过MemoryStream处理大型(100Mb +)流可能会在x86系统上出现内存不足的情况。 In addition most common pattern to return data is to call GetArray as you do which additionally requires about the same amount of space as last array buffer used for MemoryStream. 此外,返回数据的最常见模式是调用GetArray,因为这需要与用于MemoryStream的最后一个数组缓冲区大约相同的空间量。

Approaches to solve: 解决方法:

  • The cheapest way is to pre-grow MemoryStream to approximate size you need (preferably slightly large). 最便宜的方法是将MemoryStream预先增长到大约所需的大小(最好略大)。 You can pre-compute size that is required by reading to fake stream that does not store anything (waste of CPU resources, but you'll be able to read it). 您可以通过读取不存储任何内容的假流来预先计算所需的大小(浪费CPU资源,但您将能够读取它)。 Consider also returning stream instead of byte array (or return byte array of MemoryStream buffer along with length). 还要考虑返回流而不是字节数组(或者返回MemoryStream缓冲区的字节数组以及长度)。
  • Another option to handle it if you need whole stream or byte array is to use temporary file stream instead of MemoryStream to store large amount of data. 如果需要整个流或字节数组,另一个处理它的选项是使用临时文件流而不是MemoryStream来存储大量数据。
  • More complicated approach is to implement stream that chunks underlying data in smaller (ie 64K) blocks to avoid allocation on LOH and copying data when stream needs to grow. 更复杂的方法是实现在较小(即64K)块中分块底层数据的流,以避免在LOH上分配并在流需要增长时复制数据。

You can try a test like the following to get a feel for how much you can write to MemoryStream before getting a OutOfMemoryException : 您可以尝试像下面这样的测试,以了解在获取OutOfMemoryException之前可以写入MemoryStream的数量:

        const int bufferSize = 4096;
        byte[] buffer = new byte[bufferSize];

        int fileSize = 1000 * 1024 * 1024;

        int total = 0;

        try
        {
            using (MemoryStream memory = new MemoryStream())
            {
                while (total < fileSize)
                {
                    memory.Write(buffer, 0, bufferSize);
                    total += bufferSize;
                }

            }

            MessageBox.Show("No errors"); 

        }
        catch (OutOfMemoryException)
        {
            MessageBox.Show("OutOfMemory around size : " + (total / (1024m * 1024.0m)) + "MB" ); 
        }

You may have to unzip to a temporary physical file first and re-read it in small chunks, and process as you go. 您可能必须首先解压缩到临时物理文件并以小块重新读取它,然后随时处理。

Side Point : interestingly, on a Windows XP PC, the above code gives : "OutOfMemory around size 256MB" when code targets .net 2.0, and "OutOfMemory around size 512MB" on .net 4. Side Point:有趣的是,在Windows XP PC上,上面的代码给出了:当代码以.net 2.0为目标时,“OutOfMemory大小为256MB”,以及.net 4上的“OutOfMemory大小为512MB”。

Do you happen to be processing files in multiple threads? 你碰巧在多个线程中处理文件吗? That would consume a large amount of your address space. 这将消耗大量的地址空间。 OutOfMemory errors usually aren't related to physical memory, and so MemoryStream can run out far earlier than you'd expect. OutOfMemory错误通常与物理内存无关,因此MemoryStream的运行时间可能比您预期的要早得多。 Check this discussion http://social.msdn.microsoft.com/Forums/en-AU/csharpgeneral/thread/1af59645-cdef-46a9-9eb1-616661babf90 . 请查看此讨论http://social.msdn.microsoft.com/Forums/en-AU/csharpgeneral/thread/1af59645-cdef-46a9-9eb1-616661babf90 If you switched to a 64-bit process, you'd probably be more than OK for the file sizes you're dealing with. 如果你切换到64位进程,你可能对你正在处理的文件大小没有问题。

In your current situation though, you could work with memory mapped files to get around any address size limits. 但是,在您当前的情况下,您可以使用内存映射文件来绕过任何地址大小限制。 If you're using .NET 4.0, it provides a native wrapper for the Windows functions http://msdn.microsoft.com/en-us/library/dd267535.aspx . 如果您使用的是.NET 4.0,它会为Windows函数http://msdn.microsoft.com/en-us/library/dd267535.aspx提供本机包装器。

I understand that I can't have object more than 2GB since I use 32-bit 我明白,因为我使用32位,所以我的对象不能超过2GB

That is incorrect. 那是不对的。 You can have as much memory as you need. 您可以根据需要拥有尽可能多的内存。 32-bit limitation means you can only have 4GB (OS takes half of it) of Virtual Address Space. 32位限制意味着您只能拥有4GB(操作系统占用一半)的虚拟地址空间。 Virtual Address Space is not memory. 虚拟地址空间不是内存。 Here is a nice read. 是一个很好的阅读。

why did I get System.OutMemoryException? 为什么我得到System.OutMemoryException?

Because allocator could not find contiguous address space for your object or it happens too fast and it clogs. 因为分配器无法为您的对象找到连续的地址空间,或者它发生得太快而且会堵塞。 (Most likely the first one) (很可能是第一个)

what is the best possible solution to decompress gzip files and do some text processing afterwards? 什么是解压缩gzip文件并在之后进行一些文本处理的最佳解决方案?

Write a script that download the files, then uses tools like gzip or 7zip to decompress it and then process it. 编写一个下载文件的脚本,然后使用gzip或7zip等工具对其进行解压缩然后进行处理。 Depending on kind of processing, numbers of files and total size you will have to save them at some point to avoid this kind memory problems. 根据处理类型,文件数量和总大小,您必须在某些时候保存它们以避免这种类型的内存问题。 Save them after unziping and process 1MB at once. 在解压缩后保存它们并立即处理1MB。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM