[英]What's the fastest way of appending text from one file to another with huge files
So I have 5 textfiles that are 50GB each and I'd like to combine all of them into 1 textfile and then call the LINQ statement .Distinct()
so that there are only unique entries in the new file.所以我有 5 个文本文件,每个文本文件 50GB,我想将它们全部合并到 1 个文本文件中,然后调用 LINQ 语句.Distinct()
以便新文件中只有唯一条目。
The way I'm doing it now is like so我现在这样做的方式就像这样
foreach (var file in files)
{
if (Path.GetExtension(file) == ".txt")
{
var lines = File.ReadAllLines(file);
var b = lines.Distinct();
File.AppendAllLines(clear, lines);
}
}
The issue that occurs here is that the application now loads the entire textfile into memory, making my RAM usage go up to 100%.此处出现的问题是应用程序现在将整个文本文件加载到 memory,使我的 RAM 使用率 go 达到 100%。 This solution might of worked if I had 64GB of ram but I only have 16GB.如果我有 64GB 的内存但我只有 16GB,这个解决方案可能会奏效。 What's the best option for me to achieve what I'm trying to accomplish?实现我想要实现的目标的最佳选择是什么? Should I utilize the cores on my CPU?我应该使用 CPU 上的内核吗? Running a 5900x.运行 5900x。
If maintaining order is not important, and if the potential characters are limited (eg AZ), a possibility would be to say, "OK, let's start with the As".如果保持顺序不重要,并且可能的字符有限(例如 AZ),则可能会说“好吧,让我们从 As 开始”。
So you start with each file, and go through line by line until you find a line starting with 'A'.因此,您从每个文件开始,逐行查找 go,直到找到以“A”开头的行。 If you find one, add it to a new file and a HashSet.如果找到一个,将它添加到一个新文件和一个 HashSet 中。 Each time you find a new line starting with 'A', check if it is in the HashSet, and if not add it to both the new file and the HashSet.每次找到以“A”开头的新行时,检查它是否在 HashSet 中,如果不在,则将其添加到新文件和 HashSet 中。 Once you've processed all files, dispose the HashSet and skip to the next letter (B).处理完所有文件后,处理 HashSet 并跳到下一个字母 (B)。
You're going to iterate through the files 26 times this way.您将以这种方式遍历文件 26 次。
Of course you can optimise it even further.当然,您可以进一步优化它。 Check how much memory is available and divide the possible characters by ranges, so for example with the first iteration your HashSet might contain anything starting with AD.检查有多少 memory 可用并将可能的字符按范围划分,因此例如在第一次迭代中,您的 HashSet 可能包含以 AD 开头的任何内容。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.