并行读取和 Zip 个条目文件

Question

I am trying to create a Zip from a list of files in parallel and stream it to client.我正在尝试从并行文件列表中创建一个 Zip，并将其发送给客户端 stream。

I have a working code where I iterate over files sequentially, but I want it instead to be zipped in parallel (multiple files with >100mb each).我有一个工作代码，我在其中按顺序迭代文件，但我希望它被并行压缩（多个文件每个 >100mb）。

using ZipArchive zipArchive = new(Response.BodyWriter.AsStream(), ZipArchiveMode.Create, leaveOpen: false);

for (int i = 0; i < arrLocalFilesPath.Length; i++) // iterate over files
{
    string strFilePath = arrLocalFilesPath[i]; // list of files path
    string strFileName = Path.GetFileName(strFilePath);

    ZipArchiveEntry zipEntry = zipArchive.CreateEntry(strFileName, CompressionLevel.Optimal);
    using Stream zipStream = zipEntry.Open();

    using FileStream fileStream = System.IO.File.Open(strFilePath, FileMode.Open, FileAccess.Read);
    fileStream.CopyTo(zipStream);
}

return new EmptyResult();

Parallel.For and Parallel.ForEach do not work with ZipArchive Parallel.For和Parallel.ForEach不适用于ZipArchive

Since ZipArchive is not thread safe, I am trying to use DotNetZip to accomplish this task.由于ZipArchive不是线程安全的，我正在尝试使用DotNetZip来完成此任务。

I looked at the docs and here's what I have so far using DotNetZip我查看了文档，这是我到目前为止使用DotNetZip的内容

using Stream streamResponseBody = Response.BodyWriter.AsStream();

Parallel.For(0, arrLocalFilesPath.Length, i =>
{
    string strFilePath = arrLocalFilesPath[i]; // list of files path
    string strFileName = Path.GetFileName(strFilePath);

    string strCompressedOutputFile = strFilePath + ".compressed";

    byte[] arrBuffer = new byte[8192]; //[4096];
    int n = -1;

    using FileStream input = System.IO.File.OpenRead(strFilePath);
    using FileStream raw = new(strCompressedOutputFile, FileMode.Create, FileAccess.ReadWrite);

    using Stream compressor = new ParallelDeflateOutputStream(raw);
    while ((n = input.Read(arrBuffer, 0, arrBuffer.Length)) != 0)
    {
        compressor.Write(arrBuffer, 0, n);
    }

    input.CopyTo(streamResponseBody);
});

return new EmptyResult();

However, this doesn't zip files and send to client (it only creates local zip files on the server).但是，这不会将 zip 文件发送到客户端（它只会在服务器上创建本地 zip 文件）。

Using MemoryStream or creating a local zip file is out of the question and not what I am looking for.使用MemoryStream或创建本地 zip 文件是不可能的，也不是我要找的。

The server should seamlessly stream read bytes of a file, zip it on the fly and send it to client as chunks (like in my ZipArchive ), but with the added benefits of reading those files in parallel and creating a zip of them.服务器应该无缝地 stream 读取一个文件的字节， zip 它在运行中并将它作为块发送给客户端（就像在我的ZipArchive中），但是并行读取这些文件并创建它们的 zip 的额外好处。

I know that parallelism is usually not optimal for I/O (sometimes a bit worse), but parallel zipping multiple big files should be faster for this case.我知道并行通常不是 I/O 的最佳选择（有时更糟），但对于这种情况，并行压缩多个大文件应该更快。

I also tried to use SharpZipLib without success.我也尝试使用SharpZipLib但没有成功。

Usage of any other libraries is fine as long as it read and stream files to client seamlessly without impacting memory.使用任何其他库都很好，只要它读取 stream 文件到客户端，而不影响 memory。

Any help is appreciated.任何帮助表示赞赏。

Answer 1

If these files are on the same drive there won't be any speed up.如果这些文件位于同一驱动器上，则不会有任何加速。 The parallelization is used to compress/decompress data, but the disk IO operation cannot be done in parallel.并行化用于压缩/解压缩数据，但磁盘IO操作不能并行进行。

Assuming that files are not on the same drive and there is a chance to speed up this process...假设文件不在同一个驱动器上，并且有机会加快这个过程......

Are you sure the Stream.CopyTo() is thread safe?您确定Stream.CopyTo()是线程安全的吗？ Either check the docs or use single thread or set lock on it.检查文档或使用单线程或对其设置lock 。

EDIT:编辑：

I've checked my old codes, where I was packing huge amount of data into a zip file using ZipArchive .我检查了我的旧代码，在那里我使用ZipArchive将大量数据打包到一个 zip 文件中。 I did it in parallel, but there was no IO read there.我并行做了，但是那里没有读到 IO 。

You can use ZipArchive with Parallel.For but you need to use lock :您可以将ZipArchive与Parallel.For一起使用，但您需要使用lock ：

//create zip into stream
using (ZipArchive zipArchive = new ZipArchive(zipFS, ZipArchiveMode.Update, false))
{    
    //use parallel foreach instead of parallel, but not for IO read operation!
    Parallel.ForEach(listOfFiles, filename =>
    {
        //create a file entry
        ZipArchiveEntry zipFileEntry = zipArchive.CreateEntry(filename);

        //prepare memory for the entry
        MemoryStream ms = new MemoryStream();

        /*fill the memory stream here - I did another packing with BZip2OutputStream, because the zip was packed without compression to speed up random decompression */

        //only one thread can write to zip!
        lock (zipFileEntry)
        {
            //open stream for writing 
            using (Stream zipEntryStream = zipFileEntry.Open())
            {
                ms.Position = 0; // rewind the stream
                StreamUtils.Copy(ms, zipEntryStream, new byte[4096]); //from  ICSharpCode.SharpZipLib.Core, copy memory stream data into zip entry with packing.            
            }
        }
    }
}

Anyway, if you need to read the files first, it's your performance bottleneck.无论如何，如果您需要先读取文件，那将是您的性能瓶颈。 You won't gain a lot (if anything) from parallel approach here.您不会从此处的并行方法中获得很多（如果有的话）。

并行读取和 Zip 个条目文件

问题描述

1 个解决方案

解决方案1
1 2023-01-25 23:04:01

并行读取和 Zip 个条目文件

问题描述

1 个解决方案

解决方案1 1 2023-01-25 23:04:01

解决方案1
1 2023-01-25 23:04:01