简体   繁体   English

如何在 C# 中编写超快的文件流代码?

[英]How to write super-fast file-streaming code in C#?

I have to split a huge file into many smaller files.我必须将一个大文件拆分成许多较小的文件。 Each of the destination files is defined by an offset and length as the number of bytes.每个目标文件都由偏移量和长度定义为字节数。 I'm using the following code:我正在使用以下代码:

private void copy(string srcFile, string dstFile, int offset, int length)
{
    BinaryReader reader = new BinaryReader(File.OpenRead(srcFile));
    reader.BaseStream.Seek(offset, SeekOrigin.Begin);
    byte[] buffer = reader.ReadBytes(length);

    BinaryWriter writer = new BinaryWriter(File.OpenWrite(dstFile));
    writer.Write(buffer);
}

Considering that I have to call this function about 100,000 times, it is remarkably slow.考虑到我必须调用这个函数大约 100,000 次,它非常慢。

  1. Is there a way to make the Writer connected directly to the Reader?有没有办法让 Writer 直接连接到 Reader? (That is, without actually loading the contents into the Buffer in memory.) (也就是说,没有实际将内容加载到内存中的 Buffer 中。)

I don't believe there's anything within .NET to allow copying a section of a file without buffering it in memory. 我不相信.NET中有任何东西允许复制文件的一部分而不在内存中缓冲它。 However, it strikes me that this is inefficient anyway, as it needs to open the input file and seek many times. 然而,无论如何,这对我来说是低效的,因为它需要打开输入文件并多次搜索。 If you're just splitting up the file, why not open the input file once, and then just write something like: 如果您只是拆分文件,为什么不打开输入文件一次,然后只写下:

public static void CopySection(Stream input, string targetFile, int length)
{
    byte[] buffer = new byte[8192];

    using (Stream output = File.OpenWrite(targetFile))
    {
        int bytesRead = 1;
        // This will finish silently if we couldn't read "length" bytes.
        // An alternative would be to throw an exception
        while (length > 0 && bytesRead > 0)
        {
            bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
            output.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }
}

This has a minor inefficiency in creating a buffer on each invocation - you might want to create the buffer once and pass that into the method as well: 这在每次调用时创建缓冲区的效率很小 - 您可能希望创建一次缓冲区并将其传递给方法:

public static void CopySection(Stream input, string targetFile,
                               int length, byte[] buffer)
{
    using (Stream output = File.OpenWrite(targetFile))
    {
        int bytesRead = 1;
        // This will finish silently if we couldn't read "length" bytes.
        // An alternative would be to throw an exception
        while (length > 0 && bytesRead > 0)
        {
            bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
            output.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }
}

Note that this also closes the output stream (due to the using statement) which your original code didn't. 请注意,这也会关闭原始代码没有的输出流(由于using语句)。

The important point is that this will use the operating system file buffering more efficiently, because you reuse the same input stream, instead of reopening the file at the beginning and then seeking. 重要的是,这将更有效地使用操作系统文件缓冲,因为您重用相同的输入流,而不是在开始时重新打开文件然后寻求。

I think it'll be significantly faster, but obviously you'll need to try it to see... 认为它会明显加快,但显然你需要尝试看看......

This assumes contiguous chunks, of course. 当然,这假定是连续的块。 If you need to skip bits of the file, you can do that from outside the method. 如果您需要跳过文件的位,可以从方法外部执行此操作。 Also, if you're writing very small files, you may want to optimise for that situation too - the easiest way to do that would probably be to introduce a BufferedStream wrapping the input stream. 此外,如果您正在编写非常小的文件,您可能也希望针对该情况进行优化 - 最简单的方法可能是引入包装输入流的BufferedStream

The fastest way to do file I/O from C# is to use the Windows ReadFile and WriteFile functions. 从C#执行文件I / O的最快方法是使用Windows ReadFile和WriteFile函数。 I have written a C# class that encapsulates this capability as well as a benchmarking program that looks at differnet I/O methods, including BinaryReader and BinaryWriter. 我编写了一个C#类来封装这个功能,以及一个查看不同I / O方法的基准测试程序,包括BinaryReader和BinaryWriter。 See my blog post at: 请参阅我的博文:

http://designingefficientsoftware.wordpress.com/2011/03/03/efficient-file-io-from-csharp/ http://designingefficientsoftware.wordpress.com/2011/03/03/efficient-file-io-from-csharp/

How large is length ? length多大? You may do better to re-use a fixed sized (moderately large, but not obscene) buffer, and forget BinaryReader ... just use Stream.Read and Stream.Write . 您可以更好地重用固定大小(中等大小但不淫秽)的缓冲区,并忘记BinaryReader ...只需使用Stream.ReadStream.Write

(edit) something like: (编辑)类似于:

private static void copy(string srcFile, string dstFile, int offset,
     int length, byte[] buffer)
{
    using(Stream inStream = File.OpenRead(srcFile))
    using (Stream outStream = File.OpenWrite(dstFile))
    {
        inStream.Seek(offset, SeekOrigin.Begin);
        int bufferLength = buffer.Length, bytesRead;
        while (length > bufferLength &&
            (bytesRead = inStream.Read(buffer, 0, bufferLength)) > 0)
        {
            outStream.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
        while (length > 0 &&
            (bytesRead = inStream.Read(buffer, 0, length)) > 0)
        {
            outStream.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }        
}

You shouldn't re-open the source file each time you do a copy, better open it once and pass the resulting BinaryReader to the copy function. 每次复制时都不应重新打开源文件,最好打开一次并将生成的BinaryReader传递给复制功能。 Also, it might help if you order your seeks, so you don't make big jumps inside the file. 此外,如果你订购你的搜索,它可能会有所帮助,所以你不要在文件中大幅跳跃。

If the lengths aren't too big, you can also try to group several copy calls by grouping offsets that are near to each other and reading the whole block you need for them, for example: 如果长度不是太大,您还可以尝试通过对彼此靠近的偏移进行分组并读取所需的整个块来对多个复制调用进行分组,例如:

offset = 1234, length = 34
offset = 1300, length = 40
offset = 1350, length = 1000

can be grouped to one read: 可以分组为一个读:

offset = 1234, length = 1074

Then you only have to "seek" in your buffer and can write the three new files from there without having to read again. 然后你只需要在你的缓冲区中“寻找”并可以从那里写下三个新文件,而无需再次阅读。

Have you considered using the CCR since you are writing to separate files you can do everything in parallel (read and write) and the CCR makes it very easy to do this. 您是否考虑过使用CCR,因为您正在编写单独的文件,您可以并行执行所有操作(读取和写入),而CCR可以很容易地执行此操作。

static void Main(string[] args)
    {
        Dispatcher dp = new Dispatcher();
        DispatcherQueue dq = new DispatcherQueue("DQ", dp);

        Port<long> offsetPort = new Port<long>();

        Arbiter.Activate(dq, Arbiter.Receive<long>(true, offsetPort,
            new Handler<long>(Split)));

        FileStream fs = File.Open(file_path, FileMode.Open);
        long size = fs.Length;
        fs.Dispose();

        for (long i = 0; i < size; i += split_size)
        {
            offsetPort.Post(i);
        }
    }

    private static void Split(long offset)
    {
        FileStream reader = new FileStream(file_path, FileMode.Open, 
            FileAccess.Read);
        reader.Seek(offset, SeekOrigin.Begin);
        long toRead = 0;
        if (offset + split_size <= reader.Length)
            toRead = split_size;
        else
            toRead = reader.Length - offset;

        byte[] buff = new byte[toRead];
        reader.Read(buff, 0, (int)toRead);
        reader.Dispose();
        File.WriteAllBytes("c:\\out" + offset + ".txt", buff);
    }

This code posts offsets to a CCR port which causes a Thread to be created to execute the code in the Split method. 此代码将偏移量发布到CCR端口,这会导致创建一个Thread以执行Split方法中的代码。 This causes you to open the file multiple times but gets rid of the need for synchronization. 这会导致您多次打开文件,但无需同步。 You can make it more memory efficient but you'll have to sacrifice speed. 你可以提高内存效率,但你必须牺牲速度。

The first thing I would recommend is to take measurements. 我建议的第一件事就是进行测量。 Where are you losing your time? 你在哪里浪费时间? Is it in the read, or the write? 是在阅读还是写?

Over 100,000 accesses (sum the times): How much time is spent allocating the buffer array? 超过100,000次访问(总和次数):分配缓冲区数组花费了多少时间? How much time is spent opening the file for read (is it the same file every time?) How much time is spent in read and write operations? 打开文件进行读取花费了多少时间(每次都是同一个文件?)读写操作花了多少时间?

If you aren't doing any type of transformation on the file, do you need a BinaryWriter, or can you use a filestream for writes? 如果您没有对文件进行任何类型的转换,您是否需要BinaryWriter,或者您是否可以使用文件流进行写入? (try it, do you get identical output? does it save time?) (尝试一下,你得到相同的输出吗?它能节省时间吗?)

Using FileStream + StreamWriter I know it's possible to create massive files in little time (less than 1 min 30 seconds). 使用FileStream + StreamWriter我知道可以在很短的时间内(少于1分30秒)创建大量文件。 I generate three files totaling 700+ megabytes from one file using that technique. 我使用该技术从一个文件生成总共700多兆字节的三个文件。

Your primary problem with the code you're using is that you are opening a file every time. 您正在使用的代码的主要问题是您每次都打开一个文件。 That is creating file I/O overhead. 那就是创建文件I / O开销。

If you knew the names of the files you would be generating ahead of time, you could extract the File.OpenWrite into a separate method; 如果您知道要提前生成的文件的名称,则可以将File.OpenWrite解压缩到单独的方法中; it will increase the speed. 它会提高速度。 Without seeing the code that determines how you are splitting the files, I don't think you can get much faster. 如果没有看到确定如何拆分文件的代码,我认为你不会更快。

No one suggests threading? 没有人建议线程? Writing the smaller files looks like text book example of where threads are useful. 编写较小的文件看起来像是线程有用的文本书示例。 Set up a bunch of threads to create the smaller files. 设置一堆线程来创建较小的文件。 this way, you can create them all in parallel and you don't need to wait for each one to finish. 这样,你可以并行创建它们,而不需要等待每一个完成。 My assumption is that creating the files(disk operation) will take WAY longer than splitting up the data. 我的假设是创建文件(磁盘操作)将花费比分割数据更长的时间。 and of course you should verify first that a sequential approach is not adequate. 当然,您应首先验证顺序方法是不够的。

I have a similar question about writing bit streams.我有一个关于编写位流的类似问题。 My code (C# => Visual Studio 2019 => .NET Core) loops through a list returned from a repository, then writes streams to file, but it's taking 48 to 36 hours or more to convert the blogs to streams and write to file.我的代码(C# => Visual Studio 2019 => .NET Core)循环访问从存储库返回的列表,然后将流写入文件,但将博客转换为流并写入文件需要 48 到 36 小时或更长时间。 Please see code below and advise if there's a faster method to writing streams.请参阅下面的代码并建议是否有更快的方法来编写流。 I looked at parallel processing, but it seems faster processors don't benefit from that approach.我查看了并行处理,但似乎更快的处理器并没有从这种方法中受益。

    private static void CreateEvidence(string bounds, string evidencePath, IJMATRepository jmatRepo)
    {
            IList<Evidence> jmat = jmatRepo.GetJMATs(bounds).ToList();
            foreach (Evidence evidence in jmat)
            {
                IList<Evidence> evidence = jmatRepo.GetEvidence(evidence.ID, "99").ToList();
                if (evidence?.Any() ?? false)
                {
                    string sPath = Utility.RemoveInvalidChars(Path.Combine(evidencePath, evidence.PlatformName));
                    Directory.CreateDirectory(sPath);
                    string dPath = Path.Combine(sPath, "ABCD");
                    Directory.CreateDirectory(dPath);
                    foreach (Evidence je in jmatEvidence)
                    {
                        string EvidencePath = Path.Combine(dPath, je.FilePath);
                        WriteEvidence(EvidencePath, je.Evidence);
                    }
                }
            }
    }
    
    private static void WriteEvidence(string evidencePath, byte[] jmatEvidence)
    {
        using (FileStream sourceStream = File.Open(evidencePath, FileMode.OpenOrCreate))
        {
            sourceStream.Seek(0, SeekOrigin.End);
            sourceStream.Write(jmatEvidence, 0, jmatEvidence.Length);
        }
    }       

(For future reference.) (备查。)

Quite possibly the fastest way to do this would be to use memory mapped files (so primarily copying memory, and the OS handling the file reads/writes via its paging/memory management). 很可能最快的方法是使用内存映射文件(因此主要是复制内存,以及通过其分页/内存管理处理文件读/写的操作系统)。

Memory Mapped files are supported in managed code in .NET 4.0. .NET 4.0中的托管代码支持内存映射文件。

But as noted, you need to profile, and expect to switch to native code for maximum performance. 但如上所述,您需要进行配置,并期望切换到本机代码以获得最佳性能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM