简体   繁体   English

如何在 .NET 中读取大 (1 GB) txt 文件?

[英]How to read a large (1 GB) txt file in .NET?

I have a 1 GB text file which I need to read line by line.我有一个 1 GB 的文本文件,需要逐行阅读。 What is the best and fastest way to do this?最好和最快的方法是什么?

private void ReadTxtFile()
{            
    string filePath = string.Empty;
    filePath = openFileDialog1.FileName;
    if (string.IsNullOrEmpty(filePath))
    {
        using (StreamReader sr = new StreamReader(filePath))
        {
            String line;
            while ((line = sr.ReadLine()) != null)
            {
                FormatData(line);                        
            }
        }
    }
}

In FormatData() I check the starting word of line which must be matched with a word and based on that increment an integer variable.FormatData()我检查必须与单词匹配的行的起始单词,并根据该增量增加一个整数变量。

void FormatData(string line)
{
    if (line.StartWith(word))
    {
        globalIntVariable++;
    }
}

If you are using .NET 4.0, try MemoryMappedFile which is a designed class for this scenario.如果您使用的是 .NET 4.0,请尝试MemoryMappedFile ,它是为此场景设计的类。

You can use StreamReader.ReadLine otherwise.否则,您可以使用StreamReader.ReadLine

Using StreamReader is probably the way to since you don't want the whole file in memory at once.使用 StreamReader 可能是一种方法,因为您不希望一次将整个文件保存在内存中。 MemoryMappedFile is more for random access than sequential reading (it's ten times as fast for sequential reading and memory mapping is ten times as fast for random access). MemoryMappedFile 比顺序读取更适合随机访问(顺序读取的速度是其十倍,而内存映射的速度是随机访问的十倍)。

You might also try creating your streamreader from a filestream with FileOptions set to SequentialScan (see FileOptions Enumeration ), but I doubt it will make much of a difference.您也可以尝试从 FileOptions 设置为 SequentialScan 的文件流创建您的流阅读器(请参阅FileOptions Enumeration ),但我怀疑它会产生很大的不同。

There are however ways to make your example more effective, since you do your formatting in the same loop as reading.然而,有一些方法可以使您的示例更有效,因为您在与阅读相同的循环中进行格式化。 You're wasting clockcycles, so if you want even more performance, it would be better with a multithreaded asynchronous solution where one thread reads data and another formats it as it becomes available.您正在浪费时钟周期,因此如果您想要更高的性能,最好使用多线程异步解决方案,其中一个线程读取数据,另一个线程在数据可用时对其进行格式化。 Checkout BlockingColletion that might fit your needs:结帐 BlockingColletion 可能适合您的需求:

Blocking Collection and the Producer-Consumer Problem 阻塞收集和生产者-消费者问题

If you want the fastest possible performance, in my experience the only way is to read in as large a chunk of binary data sequentially and deserialize it into text in parallel, but the code starts to get complicated at that point.如果您想要尽可能快的性能,根据我的经验,唯一的方法是依次读入一大块二进制数据并并行将其反序列化为文本,但此时代码开始变得复杂。

You can use LINQ :您可以使用LINQ

int result = File.ReadLines(filePath).Count(line => line.StartsWith(word));

File.ReadLines returns an IEnumerable<String> that lazily reads each line from the file without loading the whole file into memory. File.ReadLines返回一个IEnumerable<String> ,它从文件中懒惰地读取每一行,而不将整个文件加载到内存中。

Enumerable.Count counts the lines that start with the word. Enumerable.Count计算以单词开头的行数。

If you are calling this from an UI thread, use a BackgroundWorker .如果您从 UI 线程调用它,请使用BackgroundWorker

Probably to read it line by line. 大概是一行一行地读吧。

You should rather not try to force it into memory by reading to end and then processing.你不应该试图通过读取结束然后处理来强制它进入内存。

StreamReader.ReadLine should work fine. StreamReader.ReadLine应该可以正常工作。 Let the framework choose the buffering, unless you know by profiling you can do better.让框架选择缓冲,除非您知道通过分析可以做得更好。

I was facing same problem in our production server at Agenty where we see large files (sometimes 10-25 gb (\\t) tab delimited txt files).我在Agenty的生产服务器中遇到了同样的问题,我们看到大文件(有时是 10-25 gb (\\t) 制表符分隔的 txt 文件)。 And after lots of testing and research I found the best way to read large files in small chunks with for/foreach loop and setting offset and limit logic with File.ReadLines().经过大量测试和研究,我找到了使用 for/foreach 循环以小块读取大文件并使用 File.ReadLines() 设置偏移和限制逻辑的最佳方法。

int TotalRows = File.ReadLines(Path).Count(); // Count the number of rows in file with lazy load
int Limit = 100000; // 100000 rows per batch
for (int Offset = 0; Offset < TotalRows; Offset += Limit)
{
  var table = Path.FileToTable(heading: true, delimiter: '\t', offset : Offset, limit: Limit);

 // Do all your processing here and with limit and offset and save to drive in append mode
 // The append mode will write the output in same file for each processed batch.

  table.TableToFile(@"C:\output.txt");
}

See the complete code in my Github library : https://github.com/Agenty/FileReader/查看我的 Github 库中的完整代码: https : //github.com/Agenty/FileReader/

Full Disclosure - I work for Agenty, the company who owned this library and website完全披露 - 我为拥有这个图书馆和网站的公司 Agenty 工作

My file is over 13 GB:我的文件超过 13 GB:

在此处输入图片说明

You can use my class:您可以使用我的课程:

public static void Read(int length)
    {
        StringBuilder resultAsString = new StringBuilder();

        using (MemoryMappedFile memoryMappedFile = MemoryMappedFile.CreateFromFile(@"D:\_Profession\Projects\Parto\HotelDataManagement\_Document\Expedia_Rapid.jsonl\Expedia_Rapi.json"))
        using (MemoryMappedViewStream memoryMappedViewStream = memoryMappedFile.CreateViewStream(0, length))
        {
            for (int i = 0; i < length; i++)
            {
                //Reads a byte from a stream and advances the position within the stream by one byte, or returns -1 if at the end of the stream.
                int result = memoryMappedViewStream.ReadByte();

                if (result == -1)
                {
                    break;
                }

                char letter = (char)result;

                resultAsString.Append(letter);
            }
        }
    }

This code will read text of file from start to the length that you pass to the method Read(int length) and fill the resultAsString variable.此代码将从开始读取文件文本到您传递给方法Read(int length) 的长度并填充 resultAsString 变量。

It will return the bellow text:它将返回以下文本:

I'd read the file 10,000 bytes at a time.我会一次读取 10,000 个字节的文件。 Then I'd analyse those 10,000 bytes and chop them into lines and feed them to the FormatData function.然后我会分析这 10,000 个字节并将它们分成几行并将它们提供给 FormatData 函数。

Bonus points for splitting the reading and line analysation on multiple threads.在多个线程上拆分读数和线分析的奖励积分。

I'd definitely use a StringBuilder to collect all strings and might build a string buffer to keep about 100 strings in memory all the time.我肯定会使用StringBuilder来收集所有字符串,并且可能会构建一个字符串缓冲区来始终在内存中保留大约 100 个字符串。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM