简体   繁体   English

如何以高效的方式编写1GB文件C#

[英]How to write 1GB file in efficient way C#

I have .txt file (contains more than million rows) which is around 1GB and I have one list of string, I am trying to remove all the rows from the file that exist in the list of strings and creating new file but it is taking long long time. 我有.txt文件(包含超过百万行),大约1GB,我有一个字符串列表,我试图删除字符串列表中存在的文件中的所有行并创建新文件但它正在采取很长时间。

using (StreamReader reader = new StreamReader(_inputFileName))
{
   using (StreamWriter writer = new StreamWriter(_outputFileName))
   {
     string line;
     while ((line = reader.ReadLine()) != null)
     {
       if (!_lstLineToRemove.Contains(line))
              writer.WriteLine(line);
     }

    }
  }

How can I enhance the performance of my code? 如何提高代码的性能?

You may get some speedup by using PLINQ to do the work in parallel, also switching from a list to a hash set will also greatly speed up the Contains( check. HashSet is thread safe for read-only operations. 通过使用PLINQ并行完成工作,您可以获得一些加速,同时从列表切换到散列集也将大大加快Contains(检查HashSet对于只读操作是线程安全的。

private HashSet<string> _hshLineToRemove;

void ProcessFiles()
{
    var inputLines = File.ReadLines(_inputFileName);
    var filteredInputLines = inputLines.AsParallel().AsOrdered().Where(line => !_hshLineToRemove.Contains(line));
    File.WriteAllLines(_outputFileName, filteredInputLines);
}

If it does not matter that the output file be in the same order as the input file you can remove the .AsOrdered() and get some additional speed. 如果输出文件的顺序与输入文件的顺序.AsOrdered()可以删除.AsOrdered()并获得一些额外的速度。

Beyond this you are really just I/O bound, the only way to make it any faster is to get faster drives to run it on. 除此之外,你真的只是I / O绑定,唯一让它更快的方法是让更快的驱动器运行它。

The code is particularly slow because the reader and writer never execute in parallel. 代码特别慢,因为读写器永远不会并行执行。 Each has to wait for the other. 每个人都要等待另一个。

You can almost double the speed of file operations like this by having a reader thread and a writer thread. 通过使用读者线程和编写器线程,您几乎可以将文件操作的速度提高一倍。 Put a BlockingCollection between them so you can communicate between the threads and limit how many rows you buffer in memory. 在它们之间放置一个BlockingCollection ,以便您可以在线程之间进行通信,并限制在内存中缓冲的行数。

If the computation is really expensive (it isn't in your case), a third thread with another BlockingCollection doing the processing can help too. 如果计算非常昂贵(在您的情况下不是这样),那么另一个执行处理的另一个BlockingCollection的第三个线程也可以提供帮助。

Do not use buffered text routines. 不要使用缓冲的文本例程。 Use binary, unbuffered library routines and make your buffer size as big as possible. 使用二进制,无缓冲的库例程,并使缓冲区大小尽可能大。 That's how to make it the fastest. 这就是如何让它成为最快的。

Have you considered using AWK 你考虑过使用AWK吗?

AWK is a very powerfull tool to process text files, you can find more information about how to filter lines that match a certain criteria Filter text with ASK AWK是一个非常强大的工具来处理文本文件,您可以找到有关如何筛选符合特定条件的行的更多信息使用ASK过滤文本

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM