简体   繁体   English

逐行读取文本文件的最快方法是什么?

[英]What's the fastest way to read a text file line-by-line?

I want to read a text file line by line.我想逐行读取文本文件。 I wanted to know if I'm doing it as efficiently as possible within the .NET C# scope of things.我想知道我是否在 .NET C# 范围内尽可能高效地执行此操作。

This is what I'm trying so far:到目前为止,这是我正在尝试的:

var filestream = new System.IO.FileStream(textFilePath,
                                          System.IO.FileMode.Open,
                                          System.IO.FileAccess.Read,
                                          System.IO.FileShare.ReadWrite);
var file = new System.IO.StreamReader(filestream, System.Text.Encoding.UTF8, true, 128);

while ((lineOfText = file.ReadLine()) != null)
{
    //Do something with the lineOfText
}

To find the fastest way to read a file line by line you will have to do some benchmarking.要找到逐行读取文件的最快方法,您必须进行一些基准测试。 I have done some small tests on my computer but you cannot expect that my results apply to your environment.我在我的电脑上做了一些小测试,但你不能指望我的结果适用于你的环境。

Using StreamReader.ReadLine使用 StreamReader.ReadLine

This is basically your method.这基本上是你的方法。 For some reason you set the buffer size to the smallest possible value (128).出于某种原因,您将缓冲区大小设置为可能的最小值 (128)。 Increasing this will in general increase performance.增加此值通常会提高性能。 The default size is 1,024 and other good choices are 512 (the sector size in Windows) or 4,096 (the cluster size in NTFS).默认大小为 1,024,其他不错的选择是 512(Windows 中的扇区大小)或 4,096(NTFS 中的簇大小)。 You will have to run a benchmark to determine an optimal buffer size.您必须运行基准测试以确定最佳缓冲区大小。 A bigger buffer is - if not faster - at least not slower than a smaller buffer.更大的缓冲区——如果不是更快的话——至少不会比更小的缓冲区慢。

const Int32 BufferSize = 128;
using (var fileStream = File.OpenRead(fileName))
  using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize)) {
    String line;
    while ((line = streamReader.ReadLine()) != null)
      // Process line
  }

The FileStream constructor allows you to specify FileOptions . FileStream构造函数允许您指定FileOptions For example, if you are reading a large file sequentially from beginning to end, you may benefit from FileOptions.SequentialScan .例如,如果您从头到尾顺序读取一个大文件,您可能会受益于FileOptions.SequentialScan Again, benchmarking is the best thing you can do.同样,基准测试是您能做的最好的事情。

Using File.ReadLines使用 File.ReadLines

This is very much like your own solution except that it is implemented using a StreamReader with a fixed buffer size of 1,024.这与您自己的解决方案非常相似,只是它是使用固定缓冲区大小为 1,024 的StreamReader实现的。 On my computer this results in slightly better performance compared to your code with the buffer size of 128. However, you can get the same performance increase by using a larger buffer size.在我的计算机上,与缓冲区大小为 128 的代码相比,这会导致性能稍好一些。但是,您可以通过使用更大的缓冲区大小来获得相同的性能提升。 This method is implemented using an iterator block and does not consume memory for all lines.此方法使用迭代器块实现,不会消耗所有行的内存。

var lines = File.ReadLines(fileName);
foreach (var line in lines)
  // Process line

Using File.ReadAllLines使用 File.ReadAllLines

This is very much like the previous method except that this method grows a list of strings used to create the returned array of lines so the memory requirements are higher.这与前面的方法非常相似,只是此方法会增加一个字符串列表,用于创建返回的行数组,因此内存要求更高。 However, it returns String[] and not an IEnumerable<String> allowing you to randomly access the lines.但是,它返回String[]而不是IEnumerable<String>允许您随机访问这些行。

var lines = File.ReadAllLines(fileName);
for (var i = 0; i < lines.Length; i += 1) {
  var line = lines[i];
  // Process line
}

Using String.Split使用 String.Split

This method is considerably slower, at least on big files (tested on a 511 KB file), probably due to how String.Split is implemented.这种方法相当慢,至少在大文件上(在 511 KB 文件上测试),可能是由于String.Split是如何实现的。 It also allocates an array for all the lines increasing the memory required compared to your solution.它还为所有行分配一个数组,与您的解决方案相比,增加了所需的内存。

using (var streamReader = File.OpenText(fileName)) {
  var lines = streamReader.ReadToEnd().Split("\r\n".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
  foreach (var line in lines)
    // Process line
}

My suggestion is to use File.ReadLines because it is clean and efficient.我的建议是使用File.ReadLines ,因为它干净高效。 If you require special sharing options (for example you use FileShare.ReadWrite ), you can use your own code but you should increase the buffer size.如果您需要特殊的共享选项(例如您使用FileShare.ReadWrite ),您可以使用您自己的代码,但您应该增加缓冲区大小。

If you're using .NET 4, simply use File.ReadLines which does it all for you.如果您使用的是 .NET 4,只需使用File.ReadLines即可。 I suspect it's much the same as yours, except it may also use FileOptions.SequentialScan and a larger buffer (128 seems very small).我怀疑它和你差不多,除了它也可能使用FileOptions.SequentialScan和更大的缓冲区(128 似乎很小)。

While File.ReadAllLines() is one of the simplest ways to read a file, it is also one of the slowest.虽然File.ReadAllLines()是读取文件的最简单方法之一,但它也是最慢的方法之一。

If you're just wanting to read lines in a file without doing much, according to these benchmarks , the fastest way to read a file is the age old method of:如果您只是想读取文件中的行而不做太多, 根据这些基准,读取文件的最快方法是古老的方法:

using (StreamReader sr = File.OpenText(fileName))
{
        string s = String.Empty;
        while ((s = sr.ReadLine()) != null)
        {
               //do minimal amount of work here
        }
}

However, if you have to do a lot with each line, then this article concludes that the best way is the following (and it's faster to pre-allocate a string[] if you know how many lines you're going to read) :但是,如果您必须对每一行做很多事情,那么本文得出的结论是,最好的方法如下(如果您知道要读取多少行,则预先分配一个 string[] 会更快):

AllLines = new string[MAX]; //only allocate memory here

using (StreamReader sr = File.OpenText(fileName))
{
        int x = 0;
        while (!sr.EndOfStream)
        {
               AllLines[x] = sr.ReadLine();
               x += 1;
        }
} //Finished. Close the file

//Now parallel process each line in the file
Parallel.For(0, AllLines.Length, x =>
{
    DoYourStuff(AllLines[x]); //do your work here
});

Use the following code:使用以下代码:

foreach (string line in File.ReadAllLines(fileName))

This was a HUGE difference in reading performance.这是阅读性能的巨大差异。

It comes at the cost of memory consumption, but totally worth it!这是以消耗内存为代价的,但完全值得!

If the file size is not big, then it is faster to read the entire file and split it afterwards如果文件大小不大,则读取整个文件并随后将其拆分会更快

var filestreams = sr.ReadToEnd().Split(Environment.NewLine, 
                              StringSplitOptions.RemoveEmptyEntries);

There's a good topic about this in Stack Overflow question Is 'yield return' slower than "old school" return?在 Stack Overflow 问题中有一个很好的话题“收益回报”是否比“老派”回报慢? . .

It says:它说:

ReadAllLines loads all of the lines into memory and returns a string[]. ReadAllLines 将所有行加载到内存中并返回一个字符串[]。 All well and good if the file is small.如果文件很小,一切都很好。 If the file is larger than will fit in memory, you'll run out of memory.如果文件大于内存容量,则内存不足。

ReadLines, on the other hand, uses yield return to return one line at a time.另一方面,ReadLines 使用 yield return 一次返回一行。 With it, you can read any size file.有了它,您可以读取任何大小的文件。 It doesn't load the whole file into memory.它不会将整个文件加载到内存中。

Say you wanted to find the first line that contains the word "foo", and then exit.假设您想找到包含单词“foo”的第一行,然后退出。 Using ReadAllLines, you'd have to read the entire file into memory, even if "foo" occurs on the first line.使用 ReadAllLines,您必须将整个文件读入内存,即使“foo”出现在第一行也是如此。 With ReadLines, you only read one line.使用 ReadLines,您只能读取一行。 Which one would be faster?哪个会更快?

If you have enough memory, I've found some performance gains by reading the entire file into a memory stream , and then opening a stream reader on that to read the lines.如果你有足够的内存,我发现通过将整个文件读入内存流,然后在其上打开一个流阅读器来读取行,可以获得一些性能提升。 As long as you actually plan on reading the whole file anyway, this can yield some improvements.只要您实际上打算阅读整个文件,就可以产生一些改进。

You can't get any faster if you want to use an existing API to read the lines.如果您想使用现有的 API 来读取这些行,您将无法获得更快的速度。 But reading larger chunks and manually find each new line in the read buffer would probably be faster.但是读取更大的块并在读取缓冲区中手动查找每个新行可能会更快。

When you need to efficiently read and process a HUGE text file, ReadLines() and ReadAllLines() are likely to throw Out of Memory exception, this was my case.当您需要有效地读取和处理一个巨大的文本文件时,ReadLines() 和 ReadAllLines() 可能会抛出Out of Memory异常,这就是我的情况。 On the other hand, reading each line separately would take ages.另一方面,单独阅读每一行需要很长时间。 The solution was to read the file in blocks, like below.解决方案是分块读取文件,如下所示。

The class:班上:

    //can return empty lines sometimes
    class LinePortionTextReader
    {
        private const int BUFFER_SIZE = 100000000; //100M characters
        StreamReader sr = null;
        string remainder = "";

        public LinePortionTextReader(string filePath)
        {
            if (File.Exists(filePath))
            {
                sr = new StreamReader(filePath);
                remainder = "";
            }
        }

        ~LinePortionTextReader()
        {
            if(null != sr) { sr.Close(); }
        }

        public string[] ReadBlock()
        {
            if(null==sr) { return new string[] { }; }
            char[] buffer = new char[BUFFER_SIZE];
            int charactersRead = sr.Read(buffer, 0, BUFFER_SIZE);
            if (charactersRead < 1) { return new string[] { }; }
            bool lastPart = (charactersRead < BUFFER_SIZE);
            if (lastPart)
            {
                char[] buffer2 = buffer.Take<char>(charactersRead).ToArray();
                buffer = buffer2;
            }
            string s = new string(buffer);
            string[] sresult = s.Split(new string[] { "\r\n" }, StringSplitOptions.None);
            sresult[0] = remainder + sresult[0];
            if (!lastPart)
            {
                remainder = sresult[sresult.Length - 1];
                sresult[sresult.Length - 1] = "";
            }
            return sresult;
        }

        public bool EOS
        {
            get
            {
                return (null == sr) ? true: sr.EndOfStream;
            }
        }
    }

Example of use:使用示例:

    class Program
    {
        static void Main(string[] args)
        {
            if (args.Length < 3)
            {
                Console.WriteLine("multifind.exe <where to search> <what to look for, one value per line> <where to put the result>");
                return;
            }

            if (!File.Exists(args[0]))
            {
                Console.WriteLine("source file not found");
                return;
            }
            if (!File.Exists(args[1]))
            {
                Console.WriteLine("reference file not found");
                return;
            }

            TextWriter tw = new StreamWriter(args[2], false);

            string[] refLines = File.ReadAllLines(args[1]);

            LinePortionTextReader lptr = new LinePortionTextReader(args[0]);
            int blockCounter = 0;
            while (!lptr.EOS)
            {
                string[] srcLines = lptr.ReadBlock();
                for (int i = 0; i < srcLines.Length; i += 1)
                {
                    string theLine = srcLines[i];
                    if (!string.IsNullOrEmpty(theLine)) //can return empty lines sometimes
                    {
                        for (int j = 0; j < refLines.Length; j += 1)
                        {
                            if (theLine.Contains(refLines[j]))
                            {
                                tw.WriteLine(theLine);
                                break;
                            }
                        }
                    }
                }

                blockCounter += 1;
                Console.WriteLine(String.Format("100 Mb blocks processed: {0}", blockCounter));
            }
            tw.Close();
        }
    }

I believe splitting strings and array handling can be significantly improved, yet the goal here was to minimize number of disk reads.我相信拆分字符串和数组处理可以得到显着改善,但这里的目标是尽量减少磁盘读取次数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM