简体   繁体   English

如何在C#中打开一个大文本文件

[英]How to open a large text file in C#

I have a text file that contains about 100000 articles.我有一个包含大约 100000 篇文章的文本文件。 The structure of file is:文件结构为:

.Document ID 42944-YEAR:5
.Date  03\08\11
.Cat  political
Article Content 1

.Document ID 42945-YEAR:5
.Date  03\08\11
.Cat  political
Article Content 2

I want to open this file in c# for processing it line by line.我想在 c# 中打开这个文件来逐行处理它。 I tried this code:我试过这个代码:

String[] FileLines = File.ReadAllText(
                  TB_SourceFile.Text).Split(Environment.NewLine.ToCharArray()); 

But it says:但它说:

Exception of type 'System.OutOfMemoryException' was thrown.抛出了“System.OutOfMemoryException”类型的异常。

The question is How can I open this file and read it line by line.问题是如何打开此文件并逐行读取。

  • File Size: 564 MB (591,886,626 bytes)文件大小:564 MB(591,886,626 字节)
  • File Encoding: UTF-8文件编码:UTF-8
  • File contains Unicode characters.文件包含 Unicode 字符。

You can open the file and read it as a stream rather than loading everything into memory all at once.您可以打开文件并将其作为流读取,而不是一次性将所有内容加载到内存中。

From MSDN:来自 MSDN:

using System;
using System.IO;

class Test 
{
    public static void Main() 
    {
        try 
        {
            // Create an instance of StreamReader to read from a file.
            // The using statement also closes the StreamReader.
            using (StreamReader sr = new StreamReader("TestFile.txt")) 
            {
                String line;
                // Read and display lines from the file until the end of 
                // the file is reached.
                while ((line = sr.ReadLine()) != null) 
                {
                    Console.WriteLine(line);
                }
            }
        }
        catch (Exception e) 
        {
            // Let the user know what went wrong.
            Console.WriteLine("The file could not be read:");
            Console.WriteLine(e.Message);
        }
    }
}

Your file is too large to be read into memory in one go, as File.ReadAllText is trying to do.您的文件太大而无法一次性读入内存,因为File.ReadAllText正在尝试这样做。 You should instead read the file line by line.您应该逐行读取文件。

Adapted from MSDN :改编自MSDN

string line;
// Read the file and display it line by line.
using (StreamReader file = new StreamReader(@"c:\yourfile.txt"))
{
    while ((line = file.ReadLine()) != null)
    {    
        Console.WriteLine(line);
        // do your processing on each line here
    }
}

In this way, no more than a single line of the file is in memory at any one time.这样,任何时候内存中的文件都不超过一行。

If you are using .NET Framework 4, there is a new static method on System.IO.File called ReadLines that returns an IEnumerable of string.如果您使用的是 .NET Framework 4,System.IO.File 上有一个新的静态方法,称为 ReadLines,它返回字符串的 IEnumerable。 I believe it was added to the framework for this exact scenario;我相信它已添加到此确切场景的框架中; however, I have yet to use it myself.但是,我自己还没有使用它。

MSDN Documentation - File.ReadLines Method (String) MSDN 文档 - File.ReadLines 方法(字符串)

Related Stack Overflow Question - Bug in the File.ReadLines(..) method of the .net framework 4.0 相关堆栈溢出问题 - .net framework 4.0 的 File.ReadLines(..) 方法中的错误

Something like this:像这样的东西:

using (var fileStream = File.OpenText(@"path to file"))
{
    do
    {
        var fileLine = fileStream.ReadLine();
        // process fileLine here

    } while (!fileStream.EndOfStream);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM