简体   繁体   中英

Fastest way to search an array for matches in another array

Background I have two or greater files that I need to search through for matches. These files can easily have more than 20,000 lines. I need to find the fastest way to search through them and find matches between files.

I've never done matching like this, where there could me more than one match and I need to return them all.

What I know:

  • A file cannot match with itself.
  • Files match based on a set of fields. If any of the fields match, the row matches.
  • This will be run fairly frequently so it needs to be as fast as possilbe.

My current method involves excessive use of the IEnumerable LINQ methods.

    Dim fileNames As String() = lstFiles.Items.OfType(Of String)().ToArray()
    Dim fileText As IEnumerable(Of IEnumerable(Of CCDDetail)) = fileNames.Select(Function(fileName, fileIndex)
                                                                                     Dim list As New List(Of String)({fileName})
                                                                                     list.AddRange(File.ReadAllLines(fileName))
                                                                                     Return list.Where(Function(fileLine, lineIndex) Not {list.Count - 1, list.Count - 2, 0, 1, 2}.Contains(lineIndex)).
                                                                                         Select(Function(fileLine) New CCDDetail(list(0), fileLine.Substring(12, 17).Trim(), fileLine.Substring(29, 10).Trim(), fileLine.Substring(39, 8).Trim(), fileLine.Substring(48, 6).Trim(), fileLine.Substring(54, 22).Trim()))
                                                                                 End Function)
    Dim asdf = fileText.
        Select(Function(file, inx) file.
                        Select(Function(fileLine, ix) fileText.
                                   Skip(inx + 1).
                                   Select(Function(fileToSearch) fileLine.MatchesAny(ix, fileToSearch)).
                                   Aggregate(New List(Of Integer)(), Function(old, cc)
                                                                         Dim lcc As New List(Of Integer)(cc)
                                                                         lcc.Insert(0, If(old.Count > 0, old(0) + 1, 1))
                                                                         old.AddRange(lcc)
                                                                         Return old
                                                                     End Function)))

Functions in CCDDetail:

Public Function Matches(ccd2 As CCDDetail) As Boolean
    Return CustomerName = ccd2.CustomerName OrElse
            DfiAccountNumber = ccd2.DfiAccountNumber OrElse
            CustomerRefId = ccd2.CustomerRefId OrElse
            PaymentAmount = ccd2.PaymentAmount OrElse
            PaymentId = ccd2.PaymentId
End Function

Public Function MatchesAny(index As Integer, ccd2 As IEnumerable(Of CCDDetail)) As IEnumerable(Of Integer)
    Return Enumerable.Range(0, ccd2.Count).Where(Function(i) ccd2(i).Matches(Me))
End Function

This works on my test files, however when using full length files it takes around seven minutes.

Questions:

  • Does LINQ slow things down too much? Should I write my own loops?
  • Should I use regex rather than Substring?

Is there a faster way of doing this? Any performance tips?

Thanks.

UPDATE:

I just got it down to quite a lot less using a list of dictionaries and regex. I'll finish the application and then do some tests comparing variations.

    Dim fileNames As String() = lstFiles.Items.OfType(Of String)().ToArray()
    Dim textFiles As New List(Of Dictionary(Of Integer, CCDDetail))()
    Dim fileInnerText As String()
    Dim reg As Regex = New Regex(".{12}(.{17})(.{10})(.{8}).(.{6})(.{22})", RegexOptions.Compiled)
    Dim mat As Match
    Dim fileSpecText As Dictionary(Of Integer, CCDDetail)
    Dim lineMatches As New List(Of Integer())
    For i As Integer = 0 To fileNames.Length - 1
        fileInnerText = File.ReadAllLines(fileNames(i))
        fileSpecText = New Dictionary(Of Integer, CCDDetail)()
        For j As Integer = 2 To fileInnerText.Length - 3
            mat = reg.Match(fileInnerText(j))
            fileSpecText.Add(j, New CCDDetail(mat.Groups(1).Value, mat.Groups(2).Value, mat.Groups(3).Value, mat.Groups(4).Value, mat.Groups(5).Value))
        Next
        textFiles.Add(fileSpecText)
    Next
    For i As Integer = 0 To textFiles.Count - 1
        'Dim source As Dictionary(Of Integer, CCDDetail) = textFiles(i)
        For j As Integer = 2 To textFiles(i).Count - 1 + 2
            For k As Integer = i + 1 To textFiles.Count - 1
                For l As Integer = 2 To textFiles(k).Count - 1 + 2
                    If (textFiles(i)(j).Matches(textFiles(k)(l))) Then
                        lineMatches.Add({i, j, k, l})
                    End If
                Next
            Next
        Next
    Next

Please view my comments to your question. The following (untested) example code shows how you could use the Dictionary<> to possibly speed things up. It takes your "Update" and builds from there so that you can follow my C# example (sorry, I don't write VB.net). The idea is that it is faster to use your field as a key to find all matching lines (lines that have the same field value).

Your code (and mine) could be further improved to not load all files into memory at once, and when comparing two files, you only need one loaded into the dictionary at a time.

    public void CompareLines(string[] fileNames)
    {
        var textFileDictionaries = new List<Dictionary<CCDDetail,List<int>>>();
        var reg  = new Regex(".{12}(.{17})(.{10})(.{8}).(.{6})(.{22})", RegexOptions.Compiled);
        var lineMatches = new List<LineMatch>();

        foreach(var f in fileNames)
        {
            var fileInnerText = File.ReadAllLines(f);
            var fileSpecText = new Dictionary<CCDDetail,List<int>>();
            for(int j = 1; j < fileInnerText.Length - 4; ++j) // ignore 1st and last 4 lines of file
            {
                var mat = reg.Match(fileInnerText[j]);
                for(int k=1; k<=5; ++k)
                {
                    var key = new CCDDetail() { FieldId = k, Value = mat.Groups[k].Value };
                    //field and value may occur on multiple lines?
                    if (fileSpecText.ContainsKey(key) == false)
                        fileSpecText.Add(key, new List<int>());
                    fileSpecText[key].Add(j);
                }
            }
            textFileDictionaries.Add(fileSpecText);
        }
        for(int i=0; i<textFileDictionaries.Count - 2; ++i)
        {
            for (int j = i+1; j < textFileDictionaries.Count - 1; ++j)
            {
                foreach(var tup in textFileDictionaries[j])
                {
                    if(textFileDictionaries[i].ContainsKey(tup.Key))
                    {
                        // the field value might occure on multiple lines
                        lineMatches.Add(new LineMatch() { 
                            File1Index=i,
                            File1Lines = textFileDictionaries[i][tup.Key],
                            File2Index=j,
                            File2Lines = textFileDictionaries[j][tup.Key]
                        });
                    }
                }
                /*
                for (int k = 0; k < textFileDictionaries[j].Count; ++k)
                {
                    var key = textFileDictionaries[j].Keys.ToArray()[k];
                    if (textFileDictionaries[i].ContainsKey(key))
                    {
                        // the field value might occure on multiple lines
                        lineMatches.Add(new LineMatch()
                        {
                            File1Index = i,
                            File1Lines = textFileDictionaries[i][key],
                            File2Index = j,
                            File2Lines = textFileDictionaries[j][key]
                        });
                    }
                }
               */
            }
        }
    }

....

public class CCDDetail
{
    public int FieldId { get; set; }
    public string Value { get; set; }

    public override bool Equals(object obj)
    {
        return FieldId == (obj as CCDDetail).FieldId && Value.Equals((obj as CCDDetail).Value);
    }
    public override int GetHashCode()
    {
        return FieldId.GetHashCode() + Value.GetHashCode();
    }
}
public class LineMatch
{
    public int File1Index { get; set; }
    public List<int> File1Lines { get; set; }
    public int File2Index { get; set; }
    public List<int> File2Lines { get; set; }
}

Keep in mind, my assumption is you can have the same field value on multiple lines in either file being compared. Also, the LineMatch list needs post processing because it contains records of all the lines of both files that have a field in common (you might want to record which field number.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM