逐行讀取文本文件的最快方法是什么？

Question

我想逐行讀取文本文件。 我想知道我是否在 .NET C# 范圍內盡可能高效地執行此操作。

到目前為止，這是我正在嘗試的：

var filestream = new System.IO.FileStream(textFilePath,
                                          System.IO.FileMode.Open,
                                          System.IO.FileAccess.Read,
                                          System.IO.FileShare.ReadWrite);
var file = new System.IO.StreamReader(filestream, System.Text.Encoding.UTF8, true, 128);

while ((lineOfText = file.ReadLine()) != null)
{
    //Do something with the lineOfText
}

Answer 1

要找到逐行讀取文件的最快方法，您必須進行一些基准測試。 我在我的電腦上做了一些小測試，但你不能指望我的結果適用於你的環境。

使用 StreamReader.ReadLine

這基本上是你的方法。 出於某種原因，您將緩沖區大小設置為可能的最小值 (128)。 增加此值通常會提高性能。 默認大小為 1,024，其他不錯的選擇是 512（Windows 中的扇區大小）或 4,096（NTFS 中的簇大小）。 您必須運行基准測試以確定最佳緩沖區大小。 更大的緩沖區——如果不是更快的話——至少不會比更小的緩沖區慢。

const Int32 BufferSize = 128;
using (var fileStream = File.OpenRead(fileName))
  using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize)) {
    String line;
    while ((line = streamReader.ReadLine()) != null)
      // Process line
  }

FileStream構造函數允許您指定FileOptions 。 例如，如果您從頭到尾順序讀取一個大文件，您可能會受益於FileOptions.SequentialScan 。 同樣，基准測試是您能做的最好的事情。

使用 File.ReadLines

這與您自己的解決方案非常相似，只是它是使用固定緩沖區大小為 1,024 的StreamReader實現的。 在我的計算機上，與緩沖區大小為 128 的代碼相比，這會導致性能稍好一些。但是，您可以通過使用更大的緩沖區大小來獲得相同的性能提升。 此方法使用迭代器塊實現，不會消耗所有行的內存。

var lines = File.ReadLines(fileName);
foreach (var line in lines)
  // Process line

使用 File.ReadAllLines

這與前面的方法非常相似，只是此方法會增加一個字符串列表，用於創建返回的行數組，因此內存要求更高。 但是，它返回String[]而不是IEnumerable<String>允許您隨機訪問這些行。

var lines = File.ReadAllLines(fileName);
for (var i = 0; i < lines.Length; i += 1) {
  var line = lines[i];
  // Process line
}

使用 String.Split

這種方法相當慢，至少在大文件上（在 511 KB 文件上測試），可能是由於String.Split是如何實現的。 它還為所有行分配一個數組，與您的解決方案相比，增加了所需的內存。

using (var streamReader = File.OpenText(fileName)) {
  var lines = streamReader.ReadToEnd().Split("\r\n".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
  foreach (var line in lines)
    // Process line
}

我的建議是使用File.ReadLines ，因為它干凈高效。 如果您需要特殊的共享選項（例如您使用FileShare.ReadWrite ），您可以使用您自己的代碼，但您應該增加緩沖區大小。

Answer 2

如果您使用的是 .NET 4，只需使用File.ReadLines即可。 我懷疑它和你的差不多，除了它也可能使用FileOptions.SequentialScan和更大的緩沖區（128 似乎很小）。

Answer 3

雖然File.ReadAllLines()是讀取文件的最簡單方法之一，但它也是最慢的方法之一。

如果您只是想讀取文件中的行而不做太多，根據這些基准，讀取文件的最快方法是古老的方法：

using (StreamReader sr = File.OpenText(fileName))
{
        string s = String.Empty;
        while ((s = sr.ReadLine()) != null)
        {
               //do minimal amount of work here
        }
}

但是，如果您必須對每一行做很多事情，那么本文得出的結論是，最好的方法如下（如果您知道要讀取多少行，則預先分配一個 string[] 會更快）：

AllLines = new string[MAX]; //only allocate memory here

using (StreamReader sr = File.OpenText(fileName))
{
        int x = 0;
        while (!sr.EndOfStream)
        {
               AllLines[x] = sr.ReadLine();
               x += 1;
        }
} //Finished. Close the file

//Now parallel process each line in the file
Parallel.For(0, AllLines.Length, x =>
{
    DoYourStuff(AllLines[x]); //do your work here
});

Answer 4

使用以下代碼：

foreach (string line in File.ReadAllLines(fileName))

這是閱讀性能的巨大差異。

這是以消耗內存為代價的，但完全值得！

Answer 5

如果文件大小不大，則讀取整個文件並隨后將其拆分會更快

var filestreams = sr.ReadToEnd().Split(Environment.NewLine, 
                              StringSplitOptions.RemoveEmptyEntries);

Answer 6

在 Stack Overflow 問題中有一個很好的話題“收益回報”是否比“老派”回報慢？ .

它說：

ReadAllLines 將所有行加載到內存中並返回一個字符串[]。 如果文件很小，一切都很好。 如果文件大於內存容量，則內存不足。

另一方面，ReadLines 使用 yield return 一次返回一行。 有了它，您可以讀取任何大小的文件。 它不會將整個文件加載到內存中。

假設您想找到包含單詞“foo”的第一行，然后退出。 使用 ReadAllLines，您必須將整個文件讀入內存，即使“foo”出現在第一行也是如此。 使用 ReadLines，您只能讀取一行。 哪個會更快？

Answer 7

如果你有足夠的內存，我發現通過將整個文件讀入內存流，然后在其上打開一個流閱讀器來讀取行，可以獲得一些性能提升。 只要您實際上打算閱讀整個文件，就可以產生一些改進。

Answer 8

如果您想使用現有的 API 來讀取這些行，您將無法獲得更快的速度。 但是讀取更大的塊並在讀取緩沖區中手動查找每個新行可能會更快。

Answer 9

當您需要有效地讀取和處理一個巨大的文本文件時，ReadLines() 和 ReadAllLines() 可能會拋出Out of Memory異常，這就是我的情況。 另一方面，單獨閱讀每一行需要很長時間。 解決方案是分塊讀取文件，如下所示。

班上：

    //can return empty lines sometimes
    class LinePortionTextReader
    {
        private const int BUFFER_SIZE = 100000000; //100M characters
        StreamReader sr = null;
        string remainder = "";

        public LinePortionTextReader(string filePath)
        {
            if (File.Exists(filePath))
            {
                sr = new StreamReader(filePath);
                remainder = "";
            }
        }

        ~LinePortionTextReader()
        {
            if(null != sr) { sr.Close(); }
        }

        public string[] ReadBlock()
        {
            if(null==sr) { return new string[] { }; }
            char[] buffer = new char[BUFFER_SIZE];
            int charactersRead = sr.Read(buffer, 0, BUFFER_SIZE);
            if (charactersRead < 1) { return new string[] { }; }
            bool lastPart = (charactersRead < BUFFER_SIZE);
            if (lastPart)
            {
                char[] buffer2 = buffer.Take<char>(charactersRead).ToArray();
                buffer = buffer2;
            }
            string s = new string(buffer);
            string[] sresult = s.Split(new string[] { "\r\n" }, StringSplitOptions.None);
            sresult[0] = remainder + sresult[0];
            if (!lastPart)
            {
                remainder = sresult[sresult.Length - 1];
                sresult[sresult.Length - 1] = "";
            }
            return sresult;
        }

        public bool EOS
        {
            get
            {
                return (null == sr) ? true: sr.EndOfStream;
            }
        }
    }

使用示例：

    class Program
    {
        static void Main(string[] args)
        {
            if (args.Length < 3)
            {
                Console.WriteLine("multifind.exe <where to search> <what to look for, one value per line> <where to put the result>");
                return;
            }

            if (!File.Exists(args[0]))
            {
                Console.WriteLine("source file not found");
                return;
            }
            if (!File.Exists(args[1]))
            {
                Console.WriteLine("reference file not found");
                return;
            }

            TextWriter tw = new StreamWriter(args[2], false);

            string[] refLines = File.ReadAllLines(args[1]);

            LinePortionTextReader lptr = new LinePortionTextReader(args[0]);
            int blockCounter = 0;
            while (!lptr.EOS)
            {
                string[] srcLines = lptr.ReadBlock();
                for (int i = 0; i < srcLines.Length; i += 1)
                {
                    string theLine = srcLines[i];
                    if (!string.IsNullOrEmpty(theLine)) //can return empty lines sometimes
                    {
                        for (int j = 0; j < refLines.Length; j += 1)
                        {
                            if (theLine.Contains(refLines[j]))
                            {
                                tw.WriteLine(theLine);
                                break;
                            }
                        }
                    }
                }

                blockCounter += 1;
                Console.WriteLine(String.Format("100 Mb blocks processed: {0}", blockCounter));
            }
            tw.Close();
        }
    }

我相信拆分字符串和數組處理可以得到顯着改善，但這里的目標是盡量減少磁盤讀取次數。

逐行讀取文本文件的最快方法是什么？

問題描述

9 個解決方案

解決方案1
371 2011-11-07 15:41:29

解決方案2
208 2011-11-07 13:26:40

解決方案3
41 2014-07-23 13:12:44

解決方案4
18 2013-08-11 02:11:22

解決方案5
7 2011-11-07 13:29:39

解決方案6
6 2013-08-12 14:04:17

解決方案7
2 2011-11-07 13:28:59

解決方案8
2 2011-11-07 13:30:45

解決方案9
-1 2022-05-23 19:27:20

逐行讀取文本文件的最快方法是什么？

問題描述

9 個解決方案

解決方案1 371 2011-11-07 15:41:29

解決方案2 208 2011-11-07 13:26:40

解決方案3 41 2014-07-23 13:12:44

解決方案4 18 2013-08-11 02:11:22

解決方案5 7 2011-11-07 13:29:39

解決方案6 6 2013-08-12 14:04:17

解決方案7 2 2011-11-07 13:28:59

解決方案8 2 2011-11-07 13:30:45

解決方案9 -1 2022-05-23 19:27:20

解決方案1
371 2011-11-07 15:41:29

解決方案2
208 2011-11-07 13:26:40

解決方案3
41 2014-07-23 13:12:44

解決方案4
18 2013-08-11 02:11:22

解決方案5
7 2011-11-07 13:29:39

解決方案6
6 2013-08-12 14:04:17

解決方案7
2 2011-11-07 13:28:59

解決方案8
2 2011-11-07 13:30:45

解決方案9
-1 2022-05-23 19:27:20