简体   繁体   English

读取多个非常大的文件的最佳方法

[英]Best way to read multiple very large files

I need help figuring out the fastest way to read through about 80 files with over 500,000 lines in each file, and write to one master file with each input file's line as a column in the master. 我需要帮助找出最快的方法来读取每个文件中超过500,000行的大约80个文件,并写入一个主文件,每个输入文件的行作为主列中的列。 The master file must be written to a text editor like notepad and not a Microsoft product because they can't handle the number of lines. 必须将主文件写入文本编辑器(如记事本)而不是Microsoft产品,因为它们无法处理行数。

For example, the master file should look something like this: 例如,主文件应如下所示:

File1_Row1,File2_Row1,File3_Row1,...

File1_Row2,File2_Row2,File3_Row2,...

File1_Row3,File2_Row3,File3_Row3,...

etc. 等等

I've tried 2 solutions so far: 到目前为止,我尝试了2个解决方案:

  1. Create a jagged array to hold each files' contents into an array and then once reading all lines in all files, write the master file. 创建一个锯齿状数组以将每个文件的内容保存到一个数组中,然后一旦读取所有文件中的所有行,就编写主文件。 The issue with this solution is that Windows OS memory throws an error that too much virtual memory is being used. 此解决方案的问题是Windows操作系统内存引发错误,即使用了太多的虚拟内存。
  2. Dynamically create a reader thread for each of the 80 files that reads a specific line number, and once all threads finish reading a line, combine those values and write to file, and repeat for each line in all files. 为读取特定行号的80个文件中的每个文件动态创建一个阅读器线程,一旦所有线程读完一行,将这些值组合并写入文件,并对所有文件中的每一行重复。 The issue with this solution is that it is very very slow. 这个解决方案的问题是它非常慢。

Does anybody have a better solution for reading so many large files in a fast way? 有没有人能够以更快的方式阅读这么多大文件?

The best way is going to be to open the input files with a StreamReader for each one and a StreamWriter for the output file. 最好的方法是使用StreamReader为每个输入文件打开输入文件,为输出文件打开StreamWriter Then you loop through each reader and read a single line and write it to the master file. 然后循环浏览每个阅读器并读取一行并将其写入主文件。 This way you are only loading one line at a time so there should be minimal memory pressure. 这样你一次只能加载一行,所以应该有最小的内存压力。 I was able to copy 80 ~500,000 line files in 37 seconds. 我能够在37秒内复制80~500,000行文件。 An example: 一个例子:

using System;
using System.Collections.Generic;
using System.IO;
using System.Diagnostics;

class MainClass
{
    static string[] fileNames = Enumerable.Range(1, 80).Select(i => string.Format("file{0}.txt", i)).ToArray();

    public static void Main(string[] args)
    {
        var stopwatch = Stopwatch.StartNew();
        List<StreamReader> readers = fileNames.Select(f => new StreamReader(f)).ToList();

        try
        {
            using (StreamWriter writer = new StreamWriter("master.txt"))
            {
                string line = null;
                do
                {
                    for(int i = 0; i < readers.Count; i++)
                    {
                        if ((line = readers[i].ReadLine()) != null)
                        {
                            writer.Write(line);
                        }
                        if (i < readers.Count - 1)
                            writer.Write(",");
                    }
                    writer.WriteLine();
                } while (line != null);
            }
        }
        finally
        {
            foreach(var reader in readers)
            {
                reader.Close();
            }
        }
        Console.WriteLine("Elapsed {0} ms", stopwatch.ElapsedMilliseconds);
    }
}

I've assume that all the input files have the same number of lines, but you should be add the logic to keep reading when at least one file has given you data. 我假设所有输入文件具有相同的行数,但是当至少一个文件为您提供数据时,您应该添加逻辑以保持读取。

Use Memory Mapped files seems what is suitable to you. 使用内存映射文件似乎适合您。 Something that does not execute pressure on memory of your app contemporary maintaining good performance in IO operations. 对您的应用程序的内存不施加压力的东西当代在IO操作中保持良好的性能。

Here complete documentation: Memory-Mapped Files 这里有完整的文档: 内存映射文件

If you have enough memory on the computer, I would use the Parallel.Invoke construct and read each file into a pre-allocated array such as: 如果计算机上有足够的内存,我将使用Parallel.Invoke构造并将每个文件读入预先分配的数组,例如:

string[] file1lines = new string[some value];
string[] file2lines = new string[some value];
string[] file3lines = new string[some value];

Parallel.Invoke(
() =>
{
   ReadMyFile(file1,file1lines);
},
() =>
{
   ReadMyFile(file2,file2lines);
},
() =>
{
   ReadMyFile(file3,file3lines);
}
);

Each ReadMyFile method should just use the following sample code which, according to these benchmarks , is the fastest way to read a text file: 每个ReadMyFile方法应该使用以下示例代码, 根据这些基准测试 ,这是读取文本文件的最快方法:

int x = 0;
using (StreamReader sr = File.OpenText(fileName))
{
        while ((file1lines[x] = sr.ReadLine()) != null)
        {
               x += 1;
        }
}

If you need to manipulate the data from each file before writing your final output, read this article on the fastest way to do that. 如果您需要在编写最终输出之前操作每个文件中的数据,请阅读本文以最快的方式执行此操作。

Then you just need one method to write the contents to each string[] to the output as you desire. 然后你只需要一个方法将内容写入每个string []到你想要的输出。

Have an array of open file handles. 有一组打开的文件句柄。 Loop through this array and read a line from each file into a string array. 循环遍历此数组并从每个文件中读取一行到字符串数组中。 Then combine this array into the master file, append a newline at the end. 然后将此数组合并到主文件中,在末尾附加换行符。

This differs from your second approach that it is single threaded and doesn't read a specific line but always the next one. 这与您的第二种方法的不同之处在于它是单线程的,并且不会读取特定行,但总是读取下一行。

Of course you need to be error proof if there are files with less lines than others. 当然,如果文件的行数少于其他文件,则需要进行错误验证。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM