简体   繁体   English

在数组中搜索另一个数组中匹配项的最快方法

[英]Fastest way to search an array for matches in another array

Background I have two or greater files that I need to search through for matches. 背景我需要搜索两个或更多个文件以查找匹配项。 These files can easily have more than 20,000 lines. 这些文件可以轻松拥有超过20,000行。 I need to find the fastest way to search through them and find matches between files. 我需要找到最快的方法来搜索它们并找到文件之间的匹配项。

I've never done matching like this, where there could me more than one match and I need to return them all. 我从来没有做过这样的比赛,在那里我可能不止一场比赛,我需要全部归还。

What I know: 我知道的:

  • A file cannot match with itself. 文件本身不匹配。
  • Files match based on a set of fields. 文件根据一组字段进行匹配。 If any of the fields match, the row matches. 如果任何字段匹配,则该行匹配。
  • This will be run fairly frequently so it needs to be as fast as possilbe. 这将相当频繁地运行,因此它需要与possilbe一样快。

My current method involves excessive use of the IEnumerable LINQ methods. 我当前的方法涉及过多使用IEnumerable LINQ方法。

    Dim fileNames As String() = lstFiles.Items.OfType(Of String)().ToArray()
    Dim fileText As IEnumerable(Of IEnumerable(Of CCDDetail)) = fileNames.Select(Function(fileName, fileIndex)
                                                                                     Dim list As New List(Of String)({fileName})
                                                                                     list.AddRange(File.ReadAllLines(fileName))
                                                                                     Return list.Where(Function(fileLine, lineIndex) Not {list.Count - 1, list.Count - 2, 0, 1, 2}.Contains(lineIndex)).
                                                                                         Select(Function(fileLine) New CCDDetail(list(0), fileLine.Substring(12, 17).Trim(), fileLine.Substring(29, 10).Trim(), fileLine.Substring(39, 8).Trim(), fileLine.Substring(48, 6).Trim(), fileLine.Substring(54, 22).Trim()))
                                                                                 End Function)
    Dim asdf = fileText.
        Select(Function(file, inx) file.
                        Select(Function(fileLine, ix) fileText.
                                   Skip(inx + 1).
                                   Select(Function(fileToSearch) fileLine.MatchesAny(ix, fileToSearch)).
                                   Aggregate(New List(Of Integer)(), Function(old, cc)
                                                                         Dim lcc As New List(Of Integer)(cc)
                                                                         lcc.Insert(0, If(old.Count > 0, old(0) + 1, 1))
                                                                         old.AddRange(lcc)
                                                                         Return old
                                                                     End Function)))

Functions in CCDDetail: CCD详细功能:

Public Function Matches(ccd2 As CCDDetail) As Boolean
    Return CustomerName = ccd2.CustomerName OrElse
            DfiAccountNumber = ccd2.DfiAccountNumber OrElse
            CustomerRefId = ccd2.CustomerRefId OrElse
            PaymentAmount = ccd2.PaymentAmount OrElse
            PaymentId = ccd2.PaymentId
End Function

Public Function MatchesAny(index As Integer, ccd2 As IEnumerable(Of CCDDetail)) As IEnumerable(Of Integer)
    Return Enumerable.Range(0, ccd2.Count).Where(Function(i) ccd2(i).Matches(Me))
End Function

This works on my test files, however when using full length files it takes around seven minutes. 这适用于我的测试文件,但是使用完整长度的文件大约需要7分钟。

Questions: 问题:

  • Does LINQ slow things down too much? LINQ是否会使速度降低太多? Should I write my own loops? 我应该编写自己的循环吗?
  • Should I use regex rather than Substring? 我应该使用正则表达式而不是Substring吗?

Is there a faster way of doing this? 有更快的方法吗? Any performance tips? 有性能提示吗?

Thanks. 谢谢。

UPDATE: 更新:

I just got it down to quite a lot less using a list of dictionaries and regex. 我只是使用字典和正则表达式列表将其减少了很多。 I'll finish the application and then do some tests comparing variations. 我将完成应用程序,然后进行一些比较差异的测试。

    Dim fileNames As String() = lstFiles.Items.OfType(Of String)().ToArray()
    Dim textFiles As New List(Of Dictionary(Of Integer, CCDDetail))()
    Dim fileInnerText As String()
    Dim reg As Regex = New Regex(".{12}(.{17})(.{10})(.{8}).(.{6})(.{22})", RegexOptions.Compiled)
    Dim mat As Match
    Dim fileSpecText As Dictionary(Of Integer, CCDDetail)
    Dim lineMatches As New List(Of Integer())
    For i As Integer = 0 To fileNames.Length - 1
        fileInnerText = File.ReadAllLines(fileNames(i))
        fileSpecText = New Dictionary(Of Integer, CCDDetail)()
        For j As Integer = 2 To fileInnerText.Length - 3
            mat = reg.Match(fileInnerText(j))
            fileSpecText.Add(j, New CCDDetail(mat.Groups(1).Value, mat.Groups(2).Value, mat.Groups(3).Value, mat.Groups(4).Value, mat.Groups(5).Value))
        Next
        textFiles.Add(fileSpecText)
    Next
    For i As Integer = 0 To textFiles.Count - 1
        'Dim source As Dictionary(Of Integer, CCDDetail) = textFiles(i)
        For j As Integer = 2 To textFiles(i).Count - 1 + 2
            For k As Integer = i + 1 To textFiles.Count - 1
                For l As Integer = 2 To textFiles(k).Count - 1 + 2
                    If (textFiles(i)(j).Matches(textFiles(k)(l))) Then
                        lineMatches.Add({i, j, k, l})
                    End If
                Next
            Next
        Next
    Next

Please view my comments to your question. 请查看我对您问题的评论。 The following (untested) example code shows how you could use the Dictionary<> to possibly speed things up. 以下(未经测试的)示例代码显示了如何使用Dictionary <>可能加快处理速度。 It takes your "Update" and builds from there so that you can follow my C# example (sorry, I don't write VB.net). 它需要您的“ Update”并从那里进行构建,以便您可以按照我的C#示例进行(抱歉,我没有写VB.net)。 The idea is that it is faster to use your field as a key to find all matching lines (lines that have the same field value). 这样做的想法是,将字段用作查找所有匹配行(具有相同字段值的行)的键更快。

Your code (and mine) could be further improved to not load all files into memory at once, and when comparing two files, you only need one loaded into the dictionary at a time. 您的代码(和我的代码)可以得到进一步改进,以不将所有文件立即加载到内存中,并且在比较两个文件时,一次只需要将一个文件加载到字典中。

    public void CompareLines(string[] fileNames)
    {
        var textFileDictionaries = new List<Dictionary<CCDDetail,List<int>>>();
        var reg  = new Regex(".{12}(.{17})(.{10})(.{8}).(.{6})(.{22})", RegexOptions.Compiled);
        var lineMatches = new List<LineMatch>();

        foreach(var f in fileNames)
        {
            var fileInnerText = File.ReadAllLines(f);
            var fileSpecText = new Dictionary<CCDDetail,List<int>>();
            for(int j = 1; j < fileInnerText.Length - 4; ++j) // ignore 1st and last 4 lines of file
            {
                var mat = reg.Match(fileInnerText[j]);
                for(int k=1; k<=5; ++k)
                {
                    var key = new CCDDetail() { FieldId = k, Value = mat.Groups[k].Value };
                    //field and value may occur on multiple lines?
                    if (fileSpecText.ContainsKey(key) == false)
                        fileSpecText.Add(key, new List<int>());
                    fileSpecText[key].Add(j);
                }
            }
            textFileDictionaries.Add(fileSpecText);
        }
        for(int i=0; i<textFileDictionaries.Count - 2; ++i)
        {
            for (int j = i+1; j < textFileDictionaries.Count - 1; ++j)
            {
                foreach(var tup in textFileDictionaries[j])
                {
                    if(textFileDictionaries[i].ContainsKey(tup.Key))
                    {
                        // the field value might occure on multiple lines
                        lineMatches.Add(new LineMatch() { 
                            File1Index=i,
                            File1Lines = textFileDictionaries[i][tup.Key],
                            File2Index=j,
                            File2Lines = textFileDictionaries[j][tup.Key]
                        });
                    }
                }
                /*
                for (int k = 0; k < textFileDictionaries[j].Count; ++k)
                {
                    var key = textFileDictionaries[j].Keys.ToArray()[k];
                    if (textFileDictionaries[i].ContainsKey(key))
                    {
                        // the field value might occure on multiple lines
                        lineMatches.Add(new LineMatch()
                        {
                            File1Index = i,
                            File1Lines = textFileDictionaries[i][key],
                            File2Index = j,
                            File2Lines = textFileDictionaries[j][key]
                        });
                    }
                }
               */
            }
        }
    }

....

public class CCDDetail
{
    public int FieldId { get; set; }
    public string Value { get; set; }

    public override bool Equals(object obj)
    {
        return FieldId == (obj as CCDDetail).FieldId && Value.Equals((obj as CCDDetail).Value);
    }
    public override int GetHashCode()
    {
        return FieldId.GetHashCode() + Value.GetHashCode();
    }
}
public class LineMatch
{
    public int File1Index { get; set; }
    public List<int> File1Lines { get; set; }
    public int File2Index { get; set; }
    public List<int> File2Lines { get; set; }
}

Keep in mind, my assumption is you can have the same field value on multiple lines in either file being compared. 请记住,我的假设是您可以在要比较的任何一个文件的多行中使用相同的字段值。 Also, the LineMatch list needs post processing because it contains records of all the lines of both files that have a field in common (you might want to record which field number. 另外,LineMatch列表需要后处理,因为它包含两个文件中具有相同字段的所有行的记录(您可能希望记录哪个字段编号。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM