简体   繁体   English

同时比较文本文件的有效方法

[英]Efficient Methods of Comparing Text Files Simultaneously

I did check to see if any existing questions matched mine but I didn't see any, if I did, my mistake. 我确实检查了是否有任何现有的问题与我的相符,但我没有发现任何我的错误(如果有的话)。

I have two text files to compare against each other, one is a temporary log file that is overwritten sometimes, and the other is a permanent log, which will collect and append all of the contents of the temp log into one file (it will collect new lines in the log since when it last checked and append the new lines to the end of the complete log). 我有两个文本文件要相互比较,一个是有时被覆盖的临时日志文件,另一个是永久日志,它将收集临时日志的所有内容并将其附加到一个文件中(它将收集自上次检查以来在日志中添加了新行,并将新行追加到完整日志的末尾)。 However after a point this may lead to the complete log becoming quite large and therefore not so efficient to compare against so i have been thinking about different methods to approach this. 但是,在此之后,这可能会导致整个日志变得相当大,因此进行比较时效率不高,因此我一直在考虑采用不同的方法来解决此问题。

my first idea is to "buffer" the temp log (being that it will normally be the smaller of the two) strings into a list and simply loop through the archive log and do something like: 我的第一个想法是将临时日志(通常是两者中的较小者)字符串“缓冲”到列表中,然后简单地遍历存档日志并执行以下操作:

List<String> bufferedlines = new List<string>();
using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
    if (bufferedlines.Contains(ArchiveStream.ReadLine()))
    {

    }
}

Now there is a couple of ways I could proceed from here, I could create yet another list to store the inconsistencies, close the read stream (I'm not sure you can both read and write at the same time, if you can that might make things easier for my options) then open a write stream in append mode and write the list to the file. 现在,有两种方法可以从这里继续进行,我可以创建另一个列表来存储不一致之处,关闭读取流(我不确定您是否可以同时进行读写操作,如果可以的话)使我的选择更容易),然后在附加模式下打开写入流,并将列表写入文件。 alternatively, cutting out the buffering the inconsistencies, i could open a write stream while the files are being compared and on the spot write the lines that aren't matched. 或者,为了消除不一致的缓冲,我可以在比较文件时打开写流,并当场写出不匹配的行。

The other method i could think of was limited by my knowledge of whether it could be done or not, which was rather than buffer either file, compare the streams side by side as they are read and append the lines on the fly. 我想到的另一种方法是受我是否可以完成的知识所限制,而不是缓冲两个文件,而是在读取流时并排比较流,并动态添加行。 Something like: 就像是:

using (StreamReader ArchiveStream = new StreamReader(ArchivePath))
{
    using (StreamReader templogStream = new StreamReader(tempPath))
    {
        if (!(ArchiveStream.ReadAllLines.Contains(TemplogStream.ReadLine())))
        {
            //write the line to the file
        }
    }
}

As I said I'm not sure whether that would work or that it may be more efficient than the first method, so i figured i'd ask, see if anyone had insight into how this might properly be implemented, and whether it was the most efficient way or there was a better method out there. 正如我说的那样,我不确定这是否会比第一种方法有效,或者它是否会比第一种方法更有效,所以我想问一下,看看是否有人对如何正确实施该方法有深入的了解,以及它是否是可行的方法。最有效的方法,或者还有更好的方法。

Effectively what you want here is all of the items from one set that aren't in another set. 实际上,您想要的是一组中所有不在另一组中的项目。 This is set subtraction, or in LINQ terms, Except . 这是设置减法,或者用LINQ术语设置( Except If your data sets were sufficiently small you could simply do this: 如果您的数据集足够小,则可以执行以下操作:

var lines =  File.ReadLines(TempPath)
    .Except(File.ReadLines(ArchivePath))
    .ToList();//can't write to the file while reading from it
File.AppendAllLines(ArchivePath, lines);

Of course, this code requires bringing the all of the lines in the temp file into memory, because that's just how Except is implemented. 当然,此代码需要将temp文件中的所有行都放入内存,因为这正是Except的实现方式。 It creates a HashSet of all of the items so that it can efficiently find matches from the other sequence. 它创建所有项目的HashSet ,以便可以从其他序列中高效地找到匹配项。

Presumably here the number of lines that need to be added here is pretty small, so the fact that the lines that we find here all need to be stored in memory isn't a problem. 大概这里需要添加的行数很小,因此我们在这里找到的所有行都需要存储在内存中这一事实并不是问题。 If there will potentially be a lot the, you'd want to write them out to another file besides the first one (possibly concatting the two files together when done, if needed). 如果可能有很多文件,则需要将它们写到除第一个文件之外的另一个文件中(如果需要,可以将两个文件合并在一起)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM