简体   繁体   English

合并大文件的最佳方法是什么?

[英]What is the best way to merge large files?

I have to merge thousands of large files (~200MB each). 我必须合并数千个大文件(每个大约200MB)。 I would like to know what is the best way to merge this files. 我想知道合并这些文件的最佳方法是什么。 Lines will be conditionally copied to the merged file. 行将有条件地复制到合并文件。 Could it by using File.AppendAllLines or using Stream.CopyTo? 可以使用File.AppendAllLines或使用Stream.CopyTo吗?

Using File.AppendAllLines 使用File.AppendAllLines

for (int i = 0; i < countryFiles.Length; i++){
   string srcFileName = countryFiles[i];
   string[] countryExtractLines = File.ReadAllLines(srcFileName);  
   File.AppendAllLines(actualMergedFileName, countryExtractLines);
}

Using Stream.CopyTo 使用Stream.CopyTo

using (Stream destStream = File.OpenWrite(actualMergedFileName)){
  foreach (string srcFileName in countryFiles){
    using (Stream srcStream = File.OpenRead(srcFileName)){
        srcStream.CopyTo(destStream);
    }
  }
}

sab669's answer is correct, you want to use a StreamReader then loop over each line of the file... I would suggest writing each file individually however as otherwise you are going to run out of memory pretty quickly with many 200mb files sab669的答案是正确的,你想使用StreamReader然后循环遍历文件的每一行...我建议单独编写每个文件,否则你将很快用尽许多200mb文件耗尽内存

For example: 例如:

foreach(File f in files)
{
    List<String> lines = new List<String>();
    string line;
    int cnt = 0;
    using(StreamReader reader = new StreamReader(f)) {
        while((line = reader.ReadLine()) != null) {
            // TODO : Put your conditions in here
            lines.Add(line);
            cnt++;
        }
    }
    f.Close();
    // TODO : Append your lines here using StreamWriter
}

You can write the files one after the other. 您可以一个接一个地编写文件。 For example: 例如:

static void MergingFiles(string outputFile, params string[] inputTxtDocs)
{
    using (Stream outputStream = File.OpenWrite(outputFile))
    {
      foreach (string inputFile in inputTxtDocs)
      {
        using (Stream inputStream = File.OpenRead(inputFile))
        {
          inputStream.CopyTo(outputStream);
        }
      }
    }
}

In my view the above code is really high-performance as Stream.CopyTo() has really very simple algorithm so the method is high effective. 在我看来,上面的代码实际上是高性能的,因为Stream.CopyTo()具有非常简单的算法,因此该方法是高效的。 The reflector renders the heart of it as follows: 反射器使其核心如下:

private void InternalCopyTo(Stream destination, int bufferSize)
{
  int num;
  byte[] buffer = new byte[bufferSize];
  while ((num = this.Read(buffer, 0, buffer.Length)) != 0)
  {
     destination.Write(buffer, 0, num);
  }
}

Suppose you have a condition which must be true (ie a predicate) for each line in one file that you want to append to another file. 假设您要为一个文件中要追加到另一个文件的每一行都必须为true(即谓词)的条件。

You can efficiently process that as follows: 您可以按如下方式有效地处理:

var filteredLines = 
    File.ReadLines("MySourceFileName")
    .Where(line => line.Contains("Target")); // Put your own condition here.

File.AppendAllLines("MyDestinationFileName", filteredLines);

This approach scales to multiple files and avoids loading the entire file into memory. 此方法可扩展到多个文件,并避免将整个文件加载到内存中。

If instead of appending all the lines to a file, you wanted to replace the contents, you'd do: 如果不是将所有行追加到文件中,而是想要替换内容,您需要:

File.WriteAllLines("MyDestinationFileName", filteredLines);

instead of 代替

File.AppendAllLines("MyDestinationFileName", filteredLines);

Also note that there are overloads of these methods that allow you to specify the encoding, if you are not using UTF8. 另请注意,如果您不使用UTF8,则可以使用这些方法的重载来指定编码。

Finally, don't be thrown by the inconsistent method naming. 最后,不要被不一致的方法命名抛出。 File.ReadLines() does not read all lines into memory, but File.ReadAllLines() does. File.ReadLines()不会将所有行读入内存,但File.ReadAllLines()会读取。 However, File.WriteAllLines() does NOT buffer all lines into memory, or expect them to all be buffered in memory; 但是, File.WriteAllLines()不会将所有行缓冲到内存中,或者期望它们全部缓冲在内存中; it uses IEnumerable<string> for the input. 它使用IEnumerable<string>作为输入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM