简体   繁体   English

按块而不是逐行读取一个非常大的文件

[英]Read a very large file by chunks and not line-by-line

I want to read a CSV file which can be at a size of hundreds of GBs and even TB.我想读取一个大小为数百 GB 甚至 TB 的 CSV 文件。 I got a limitation that I can only read the file in chunks of 32MB.我有一个限制,即我只能以 32MB 的块读取文件。 My solution to the problem, not only does it work kinda slow, but it can also break a line in the middle of it.我对这个问题的解决方案,不仅工作速度有点慢,而且还可以在中间断线。

I wanted to ask if you know of a better solution:我想问你是否知道更好的解决方案:

const int MAX_BUFFER = 33554432; //32MB
byte[] buffer = new byte[MAX_BUFFER];
int bytesRead;

using (FileStream fs = File.Open(filePath, FileMode.Open, FileAccess.Read))
using (BufferedStream bs = new BufferedStream(fs))
{
    string line;
    bool stop = false;
    while ((bytesRead = bs.Read(buffer, 0, MAX_BUFFER)) != 0) //reading only 32mb chunks at a time
    {
        var stream = new StreamReader(new MemoryStream(buffer));
        while ((line = stream.ReadLine()) != null)
        {
            //process line
        }

    }
}

Please do not respond with a solution which reads the file line by line (for example File.ReadLines is NOT an acceptable solution).请不要使用逐行读取文件的解决方案进行响应(例如File.ReadLines不是可接受的解决方案)。 Why?为什么? Because I'm just searching for another solution...因为我只是在寻找另一种解决方案......

The problem with your solution is that you recreate the streams in each iteration.您的解决方案的问题在于您在每次迭代中重新创建流。 Try this version:试试这个版本:

const int MAX_BUFFER = 33554432; //32MB
byte[] buffer = new byte[MAX_BUFFER];
int bytesRead;
StringBuilder currentLine = new StringBuilder();

using (FileStream fs = File.Open(filePath, FileMode.Open, FileAccess.Read))
using (BufferedStream bs = new BufferedStream(fs))
{
    string line;
    bool stop = false;
    var memoryStream = new MemoryStream(buffer);
    var stream = new StreamReader(memoryStream);
    while ((bytesRead = bs.Read(buffer, 0, MAX_BUFFER)) != 0)
    {
        memoryStream.Seek(0, SeekOrigin.Begin);

        while (!stream.EndOfStream)
        {
            line = ReadLineWithAccumulation(stream, currentLine);

            if (line != null)
            {
                //process line
            }
        }
    }
}

private string ReadLineWithAccumulation(StreamReader stream, StringBuilder currentLine)
{
    while (stream.Read(buffer, 0, 1) > 0)
    {
        if (charBuffer [0].Equals('\n'))
        {
            string result = currentLine.ToString();
            currentLine.Clear();

            if (result.Last() == '\r') //remove if newlines are single character
            {
                result = result.Substring(0, result.Length - 1);
            }

            return result;
        }
        else
        {
            currentLine.Append(charBuffer [0]);
        }
    }

    return null;  //line not complete yet
}

private char[] charBuffer = new char[1];

NOTE: This needs some tweaking if newlines are two characters long and you need newline characters to be contained in the result.注意:如果换行符长度为两个字符并且您需要在结果中包含换行符,则需要进行一些调整。 The worst case would be newline pair "\\r\\n" split across two blocks.最坏的情况是换行对"\\r\\n"分成两个块。 However since you were using ReadLine I assumed that you don't need this.但是,由于您使用的是ReadLine我认为您不需要它。

Also, the problem is that in case your whole data contains only one line, this will end up in an attempt to read the whole data into memory anyway.此外,问题在于,如果您的整个数据仅包含一行,那么无论如何最终都会尝试将整个数据读入内存。

which can be at a size of hundreds of GBs and even TB

For a large file processing the most suitable class recommended is MemoryMappedFile Class对于大文件处理,推荐的最合适的类是MemoryMappedFile

Some advantages:一些优点:

  • It is ideal to access a data file on disk without performing file I/O operations and from buffering the file's content.在不执行文件 I/O 操作和缓冲文件内容的情况下访问磁盘上的数据文件是理想的。 This works great when you deal with large data files.当您处理大型数据文件时,这非常有效。

  • You can use memory mapped files to allow multiple processes running on the same machine to share data with each other.您可以使用内存映射文件来允许在同一台机器上运行的多个进程相互共享数据。

so try it and you will note the difference as swapping between memory and harddisk is a time consuming operation所以试试吧,你会注意到不同之处,因为内存和硬盘之间的交换是一个耗时的操作

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM