简体   繁体   English

使用大文件将文本从一个文件附加到另一个文件的最快方法是什么

[英]What's the fastest way of appending text from one file to another with huge files

So I have 5 textfiles that are 50GB each and I'd like to combine all of them into 1 textfile and then call the LINQ statement .Distinct() so that there are only unique entries in the new file.所以我有 5 个文本文件,每个文本文件 50GB,我想将它们全部合并到 1 个文本文件中,然后调用 LINQ 语句.Distinct()以便新文件中只有唯一条目。

The way I'm doing it now is like so我现在这样做的方式就像这样

foreach (var file in files)
{
    if (Path.GetExtension(file) == ".txt")
    {
        var lines = File.ReadAllLines(file);
        var b = lines.Distinct();
        File.AppendAllLines(clear, lines);
        
    }
}

The issue that occurs here is that the application now loads the entire textfile into memory, making my RAM usage go up to 100%.此处出现的问题是应用程序现在将整个文本文件加载到 memory,使我的 RAM 使用率 go 达到 100%。 This solution might of worked if I had 64GB of ram but I only have 16GB.如果我有 64GB 的内存但我只有 16GB,这个解决方案可能会奏效。 What's the best option for me to achieve what I'm trying to accomplish?实现我想要实现的目标的最佳选择是什么? Should I utilize the cores on my CPU?我应该使用 CPU 上的内核吗? Running a 5900x.运行 5900x。

If maintaining order is not important, and if the potential characters are limited (eg AZ), a possibility would be to say, "OK, let's start with the As".如果保持顺序不重要,并且可能的字符有限(例如 AZ),则可能会说“好吧,让我们从 As 开始”。

So you start with each file, and go through line by line until you find a line starting with 'A'.因此,您从每个文件开始,逐行查找 go,直到找到以“A”开头的行。 If you find one, add it to a new file and a HashSet.如果找到一个,将它添加到一个新文件和一个 HashSet 中。 Each time you find a new line starting with 'A', check if it is in the HashSet, and if not add it to both the new file and the HashSet.每次找到以“A”开头的新行时,检查它是否在 HashSet 中,如果不在,则将其添加到新文件和 HashSet 中。 Once you've processed all files, dispose the HashSet and skip to the next letter (B).处理完所有文件后,处理 HashSet 并跳到下一个字母 (B)。

You're going to iterate through the files 26 times this way.您将以这种方式遍历文件 26 次。

Of course you can optimise it even further.当然,您可以进一步优化它。 Check how much memory is available and divide the possible characters by ranges, so for example with the first iteration your HashSet might contain anything starting with AD.检查有多少 memory 可用并将可能的字符按范围划分,因此例如在第一次迭代中,您的 HashSet 可能包含以 AD 开头的任何内容。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 C# 中将值和键从一个字典复制到另一个字典的最快方法是什么? - What's the fastest way to copy the values and keys from one dictionary into another in C#? 通过Internet将大型二进制文件从一台PC发送到另一台PC的最快方法是什么? - What is the fastest way to send large binary file from one pc to another pc over the Internet? 逐行读取文本文件的最快方法是什么? - What's the fastest way to read a text file line-by-line? 比较两个巨大的CSV文件进行更改的最快方法是什么? - What is the fastest way of comparing two huge CSV files for a change? 从数据库表写入文本文件的最快方法是什么? - What is the fastest way to write to a text file from a database table? 将文本文件加载到RichTextBox的最快方法是什么? - What is the fastest way to load text file into RichTextBox? 将大文本分成较小块的最快方法 - Fastest way to split a huge text into smaller chunks 在 C# 中计算文本文件总行数的最快方法是什么? - What's the fastest way to count the total lines of text file in c#? C#:将一个文本文件的* contents *附加到另一个文本文件 - C#: Appending *contents* of one text file to another text file 将10万条记录从一个数据库插入另一个数据库的最快方法是什么? - What is the fastest way to insert 100 000 records from one database to another?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM