简体   繁体   中英

Combining Two Text Files Removing Duplicates

I have 2 text files that are as follows (large numbers like 1466786391 being unique timestamps):

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 49 packets received, 2% packet loss
round-trip min/avg/max = 20.917/70.216/147.258 ms
1466786342
PING 10.0.0.6 (10.0.0.6): 56 data bytes

....

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 50 packets received, 0% packet loss
round-trip min/avg/max = 29.535/65.768/126.983 ms
1466786391

and this:

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 49 packets received, 2% packet loss
round-trip min/avg/max = 20.917/70.216/147.258 ms
1466786342
PING 10.0.0.6 (10.0.0.6): 56 data bytes

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 50 packets received, 0% packet loss
round-trip min/avg/max = 29.535/65.768/126.983 ms
1466786391
PING 10.0.0.6 (10.0.0.6): 56 data byte

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 44 packets received, 12% packet loss
round-trip min/avg/max = 30.238/62.772/102.959 ms
1466786442
PING 10.0.0.6 (10.0.0.6): 56 data bytes
....

So the first file ends with the timestamp 1466786391 and the second file has the same data block somewhere in the middle and more data afterwards, the data before the specific timestamp is exactly same as the first file.

So the output I want is this:

--- 10.0.0.6 ping statistics ---
    50 packets transmitted, 49 packets received, 2% packet loss
    round-trip min/avg/max = 20.917/70.216/147.258 ms
    1466786342
    PING 10.0.0.6 (10.0.0.6): 56 data bytes

    ....

    --- 10.0.0.6 ping statistics ---
    50 packets transmitted, 50 packets received, 0% packet loss
    round-trip min/avg/max = 29.535/65.768/126.983 ms
    1466786391

 --- 10.0.0.6 ping statistics ---
    50 packets transmitted, 44 packets received, 12% packet loss
    round-trip min/avg/max = 30.238/62.772/102.959 ms
    1466786442
    PING 10.0.0.6 (10.0.0.6): 56 data bytes
....

That is, concatenate the two files, and create a third one removing the duplicates of the second file (the text blocks that's already present in the first file. Here's my code:

public static void UnionFiles()
{ 

    string folderPath = Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http");
    string outputFilePath = Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http\\union.dat");
    var union = Enumerable.Empty<string>();

    foreach (string filePath in Directory
                .EnumerateFiles(folderPath, "*.txt")
                .OrderBy(x => Path.GetFileNameWithoutExtension(x)))
    {
        union = union.Union(File.ReadAllLines(filePath));
    }
    File.WriteAllLines(outputFilePath, union);
}

This is the wrong output I am getting (the file structure is destroyed):

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 49 packets received, 2% packet loss
round-trip min/avg/max = 20.917/70.216/147.258 ms
1466786342
PING 10.0.0.6 (10.0.0.6): 56 data bytes

--- 10.0.0.6 ping statistics ---
50 packets transmitted, 50 packets received, 0% packet loss
round-trip min/avg/max = 29.535/65.768/126.983 ms
1466786391
round-trip min/avg/max = 30.238/62.772/102.959 ms
1466786442
round-trip min/avg/max = 5.475/40.986/96.964 ms
1466786492
round-trip min/avg/max = 5.276/61.309/112.530 ms

EDIT: This code was written to handle multiple files, however I am happy even if just 2 can be done correctly.

However, this doesn't remove the textblocks as it should, it removes several useful lines and makes the the output utter useless. I am stuck.

How to achieve this? Thanks.

I think you want to compare block, not really line per line.

Something like that should work:

public static void UnionFiles()
{
    var firstFilePath = "log1.txt";
    var secondFilePath = "log2.txt";

    var firstLogBlocks = ReadFileAsLogBlocks(firstFilePath);
    var secondLogBlocks = ReadFileAsLogBlocks(secondFilePath);

    var cleanLogBlock = firstLogBlocks.Union(secondLogBlocks);

    var cleanLog = new StringBuilder();
    foreach (var block in cleanLogBlock)
    {
        cleanLog.Append(block);
    }

    File.WriteAllText("cleanLog.txt", cleanLog.ToString());
}

private static List<LogBlock> ReadFileAsLogBlocks(string filePath)
{
    var allLinesLog = File.ReadAllLines(filePath);

    var logBlocks = new List<LogBlock>();
    var currentBlock = new List<string>();

    var i = 0;
    foreach (var line in allLinesLog)
    {
        if (!string.IsNullOrEmpty(line))
        {
            currentBlock.Add(line);
            if (i == 4)
            {
                logBlocks.Add(new LogBlock(currentBlock.ToArray()));
                currentBlock.Clear();
                i = 0;
            }
            else
            {
                i++;
            }
        }
    }

    return logBlocks;
}

With a log block define as follow:

public class LogBlock
{
    private readonly string[] _logs;

    public LogBlock(string[] logs)
    {
        _logs = logs;
    }

    public override string ToString()
    {
        var logBlock = new StringBuilder();
        foreach (var log in _logs)
        {
            logBlock.AppendLine(log);
        }

        return logBlock.ToString();
    }

    public override bool Equals(object obj)
    {
        return obj is LogBlock && Equals((LogBlock)obj);
    }

    private bool Equals(LogBlock other)
    {
        return _logs.SequenceEqual(other._logs);
    }

    public override int GetHashCode()
    {
        var hashCode = 0;
        foreach (var log in _logs)
        {
            hashCode += log.GetHashCode();
        }
        return hashCode;
    }
}

Please be careful to override Equals in LogBlock and to have a consistent GetHashCode implementation as Union use both of them, as explained here .

A rather hacky solution using a regular expression:

var logBlockPattern = new Regex(@"(^---.*ping statistics ---$)\s+"
                              + @"(^.+packets transmitted.+packets received.+packet loss$)\s+"
                              + @"(^round-trip min/avg/max.+$)\s+"
                              + @"(^\d+$)\s*"
                              + @"(^PING.+$)?",
                                RegexOptions.Multiline);

var logBlocks1 = logBlockPattern.Matches(FileContent1).Cast<Match>().ToList();
var logBlocks2 = logBlockPattern.Matches(FileContent2).Cast<Match>().ToList();

var mergedLogBlocks = logBlocks1.Concat(logBlocks2.Where(lb2 => 
    logBlocks1.All(lb1 => lb1.Groups[4].Value != lb2.Groups[4].Value)));

var mergedLogContents = string.Join("\n\n", mergedLogBlocks);

The Groups collection of a regex Match contains each line of a log block (because in the pattern each line is wrapped in parantheses () ) and the complete match at index 0 . Hence the matched group with index 4 is the timestamp that we can use to compare log blocks.

Working example: https://dotnetfiddle.net/kAkGll

There is issue in concating unique record. Can you please check below code?

public static void UnionFiles()
{ 

    string folderPath =     Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http");
    string outputFilePath = Path.Combine(Path.GetDirectoryName(Assembly.GetEntryAssembly().Location), "http\\union.dat");
    var union =new List<string>();

    foreach (string filePath in Directory
            .EnumerateFiles(folderPath, "*.txt")
            .OrderBy(x => Path.GetFileNameWithoutExtension(x)))
    {
         var filter = File.ReadAllLines(filePath).Where(x => !union.Contains(x)).ToList();
    union.AddRange(filter);

    }
    File.WriteAllLines(outputFilePath, union);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM