简体   繁体   中英

How to split big file(12gb) into multiple 1GB compressed(.gz) archives? C#

I have a one big .bak file - near 12gb. I need to split it on into multiple 2gb .gz archives in code.

And big problem is that I need to validate this archives later.

You know like when you split one file with winrar on 3 or 4 archives, and then you just push "unpack" and it will unpack them all into one file, or crash if there is not enough archives(you delete one).

I need something like this.

public void Compress(DirectoryInfo directorySelected)
{
    int writeStat = 0;

    foreach (FileInfo fileToCompress in directorySelected.GetFiles())
    {
        using (FileStream originalFileStream = fileToCompress.OpenRead())
        {
            if ((File.GetAttributes(fileToCompress.FullName) &
               FileAttributes.Hidden) != FileAttributes.Hidden & fileToCompress.Extension != ".gz")
            {
                bytesToRead = new byte[originalFileStream.Length];
                int numBytesRead = bytesToRead.Length;

                while (_nowOffset < originalFileStream.Length)
                {                                
                    writeStat = originalFileStream.Read(bytesToRead, 0, homMuchRead);

                    using (FileStream compressedFileStream = File.Create(fileToCompress.FullName + counter + ".gz"))
                    {
                        using (GZipStream compressionStream = new GZipStream(compressedFileStream,
                           CompressionMode.Compress))
                        {
                            compressionStream.Write(bytesToRead, 0, writeStat);
                        }
                    }
                    _nowOffset = _nowOffset + writeStat;                        
                    counter++;
                }
                FileInfo info = new FileInfo(directoryPath + Path.DirectorySeparatorChar + fileToCompress.Name + ".gz");
                //Console.WriteLine($"Compressed {fileToCompress.Name} from {fileToCompress.Length.ToString()} to {info.Length.ToString()} bytes.");
            }
        }
    }
}

It works well, but i don't know how to validate their count.

I have 7 archive on test object. But how to read them in one file, and validate that this file is full.

GZip format doesn't natively supports what you want.

Zip does, the feature is called “spanned archives” but the ZipArchive class from .NET doesn't. You'll need a third-party library for that, such as DotNetZip .

But there's workaround.

Create a class that inherits from Stream abstract one, to the outside pretends it's a single stream that can write but not read or seek, in the implementation writes to multiple pieces, 2GB/each. Use .NET provided FileStream in the implementation. Keep track of the total length written, in a long field of your class. As soon as the next Write() call gonna exceed 2GB, write just enough bytes to reach 2GB, close and dispose the underlying FileStream, open another file with the next file name, reset file length counter to 0, and write the remaining bytes from the buffer you got to the Write() call. Repeat until closed.

Create an instance of your custom stream, pass to the constructor of GZipStream, and copy the complete 12GB source data into the GZipStream.

If you'll do it right, on output you'll have files exactly 2GB in size (except the last one).

To read and decompress them, you'll need to implement similar trick with custom stream. Write a stream class that concatenates multiple files on the fly, pretending it's a single stream, but this time you only need to implement Read() method. Give that concatenating stream to the GZipStream from the framework. If you'll reorder or destroy some parts, there's very high (but not 100%) probability GZipStream will fail to decompress, complaining about CRC checksums.

PS To implement and debug the above 2 streams, I recommend using much smaller dataset, eg 12 MB of data, splitting into 1MB compressed pieces. Once you'll make it work, increase the constant and test with the complete 12GB of data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM