简体   繁体   English

并行C#读取非常大的文件

[英]Read a very large files in parallel C#

I have more than 20 files, each of them contain almost 1 million lines (5 Gigabyte), I need to speed up the reading process, so I'm trying to read those files in parallel, but it takes longer time than reading them sequentially. 我有20多个文件,每个文件包含将近一百万行(5 GB),我需要加快读取速度,所以我尝试并行读取这些文件,但是比顺序读取它们需要更长的时间。 is there any way to read a very large files in parallel? 有什么办法可以并行读取非常大的文件?

 Parallel.ForEach(sourceFilesList, filePath =>
 {
     if (!string.IsNullOrEmpty(filePath) && File.Exists(filePath))
     {
          StreamReader str = new StreamReader(filePath);
          while (!str.EndOfStream)
          {
              var temporaryObj = new object();
              string line = str.ReadLine();
              // process line here 
          }
     }
});

Its better to use buffered reader for huge files. 对于大文件,最好使用缓冲读取器。 something like this will help. 这样的事情会有所帮助。

using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, 
FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
    string line;
    while ((line = sr.ReadLine()) != null)
    {

    }
}

Why BufferedStream is faster 为什么BufferedStream更快

A buffer is a block of bytes in memory used to cache data, thereby reducing the number of calls to the operating system. 缓冲区是内存中用于缓存数据的字节块,从而减少了对操作系统的调用次数。 Buffers improve read and write performance. 缓冲区提高了读写性能。 A buffer can be used for either reading or writing, but never both simultaneously. 缓冲区可用于读取或写入,但不能同时使用。 The Read and Write methods of BufferedStream automatically maintain the buffer. BufferedStream的Read和Write方法自动维护缓冲区。

Its IO operation , suggestion is to make use of Async/Await like as below (mostly make use of ReadAsync function which helps to do read it asynchronous), Async/Await makes use of you Machine Physical Core 's efficiently. 它的IO操作,建议如下使用Async / ReadAsync (主要是利用ReadAsync函数帮助异步读取),Async / Await有效地利用了您的Machine Physical Core

public void ReadFiles()
{
  List<string> paths = new List<string>(){"path1", "path2", "path3"};
  foreach(string path in Paths)
  {
      await ProcessRead(path);
  }
}

public async void ProcessRead(filePath)
{
    if (File.Exists(filePath) == false)
    {
        Debug.WriteLine("file not found: " + filePath);
    }
    else
    {
        try
        {
            string text = await ReadTextAsync(filePath);
            Debug.WriteLine(text);
        }
        catch (Exception ex)
        {
            Debug.WriteLine(ex.Message);
        }
    }
}

private async Task<string> ReadTextAsync(string filePath)
{
    using (FileStream sourceStream = new FileStream(filePath,
        FileMode.Open, FileAccess.Read, FileShare.Read,
        bufferSize: 4096, useAsync: true))
    {
        StringBuilder sb = new StringBuilder();

        byte[] buffer = new byte[0x1000];
        int numRead;
        while ((numRead = await sourceStream.ReadAsync(buffer, 0, buffer.Length)) != 0)
        {
            string text = Encoding.Unicode.GetString(buffer, 0, numRead);
            sb.Append(text);
        }

        return sb.ToString();
    }
}

Code is taken from MSDN : Using Async for File Access (C# and Visual Basic) 代码摘自MSDN: 使用异步进行文件访问(C#和Visual Basic)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM