在數組中搜索另一個數組中匹配項的最快方法

Question

背景我需要搜索兩個或更多個文件以查找匹配項。 這些文件可以輕松擁有超過20,000行。 我需要找到最快的方法來搜索它們並找到文件之間的匹配項。

我從來沒有做過這樣的比賽，在那里我可能不止一場比賽，我需要全部歸還。

我知道的：

文件本身不匹配。
文件根據一組字段進行匹配。 如果任何字段匹配，則該行匹配。
這將相當頻繁地運行，因此它需要與possilbe一樣快。

我當前的方法涉及過多使用IEnumerable LINQ方法。

    Dim fileNames As String() = lstFiles.Items.OfType(Of String)().ToArray()
    Dim fileText As IEnumerable(Of IEnumerable(Of CCDDetail)) = fileNames.Select(Function(fileName, fileIndex)
                                                                                     Dim list As New List(Of String)({fileName})
                                                                                     list.AddRange(File.ReadAllLines(fileName))
                                                                                     Return list.Where(Function(fileLine, lineIndex) Not {list.Count - 1, list.Count - 2, 0, 1, 2}.Contains(lineIndex)).
                                                                                         Select(Function(fileLine) New CCDDetail(list(0), fileLine.Substring(12, 17).Trim(), fileLine.Substring(29, 10).Trim(), fileLine.Substring(39, 8).Trim(), fileLine.Substring(48, 6).Trim(), fileLine.Substring(54, 22).Trim()))
                                                                                 End Function)
    Dim asdf = fileText.
        Select(Function(file, inx) file.
                        Select(Function(fileLine, ix) fileText.
                                   Skip(inx + 1).
                                   Select(Function(fileToSearch) fileLine.MatchesAny(ix, fileToSearch)).
                                   Aggregate(New List(Of Integer)(), Function(old, cc)
                                                                         Dim lcc As New List(Of Integer)(cc)
                                                                         lcc.Insert(0, If(old.Count > 0, old(0) + 1, 1))
                                                                         old.AddRange(lcc)
                                                                         Return old
                                                                     End Function)))

CCD詳細功能：

Public Function Matches(ccd2 As CCDDetail) As Boolean
    Return CustomerName = ccd2.CustomerName OrElse
            DfiAccountNumber = ccd2.DfiAccountNumber OrElse
            CustomerRefId = ccd2.CustomerRefId OrElse
            PaymentAmount = ccd2.PaymentAmount OrElse
            PaymentId = ccd2.PaymentId
End Function

Public Function MatchesAny(index As Integer, ccd2 As IEnumerable(Of CCDDetail)) As IEnumerable(Of Integer)
    Return Enumerable.Range(0, ccd2.Count).Where(Function(i) ccd2(i).Matches(Me))
End Function

這適用於我的測試文件，但是使用完整長度的文件大約需要7分鍾。

問題：

LINQ是否會使速度降低太多？ 我應該編寫自己的循環嗎？
我應該使用正則表達式而不是Substring嗎？

有更快的方法嗎？ 有性能提示嗎？

謝謝。

更新：

我只是使用字典和正則表達式列表將其減少了很多。 我將完成應用程序，然后進行一些比較差異的測試。

    Dim fileNames As String() = lstFiles.Items.OfType(Of String)().ToArray()
    Dim textFiles As New List(Of Dictionary(Of Integer, CCDDetail))()
    Dim fileInnerText As String()
    Dim reg As Regex = New Regex(".{12}(.{17})(.{10})(.{8}).(.{6})(.{22})", RegexOptions.Compiled)
    Dim mat As Match
    Dim fileSpecText As Dictionary(Of Integer, CCDDetail)
    Dim lineMatches As New List(Of Integer())
    For i As Integer = 0 To fileNames.Length - 1
        fileInnerText = File.ReadAllLines(fileNames(i))
        fileSpecText = New Dictionary(Of Integer, CCDDetail)()
        For j As Integer = 2 To fileInnerText.Length - 3
            mat = reg.Match(fileInnerText(j))
            fileSpecText.Add(j, New CCDDetail(mat.Groups(1).Value, mat.Groups(2).Value, mat.Groups(3).Value, mat.Groups(4).Value, mat.Groups(5).Value))
        Next
        textFiles.Add(fileSpecText)
    Next
    For i As Integer = 0 To textFiles.Count - 1
        'Dim source As Dictionary(Of Integer, CCDDetail) = textFiles(i)
        For j As Integer = 2 To textFiles(i).Count - 1 + 2
            For k As Integer = i + 1 To textFiles.Count - 1
                For l As Integer = 2 To textFiles(k).Count - 1 + 2
                    If (textFiles(i)(j).Matches(textFiles(k)(l))) Then
                        lineMatches.Add({i, j, k, l})
                    End If
                Next
            Next
        Next
    Next

Answer 1

請查看我對您問題的評論。 以下（未經測試的）示例代碼顯示了如何使用Dictionary <>可能加快處理速度。 它需要您的“ Update”並從那里進行構建，以便您可以按照我的C＃示例進行（抱歉，我沒有寫VB.net）。 這樣做的想法是，將字段用作查找所有匹配行（具有相同字段值的行）的鍵更快。

您的代碼（和我的代碼）可以得到進一步改進，以不將所有文件立即加載到內存中，並且在比較兩個文件時，一次只需要將一個文件加載到字典中。

    public void CompareLines(string[] fileNames)
    {
        var textFileDictionaries = new List<Dictionary<CCDDetail,List<int>>>();
        var reg  = new Regex(".{12}(.{17})(.{10})(.{8}).(.{6})(.{22})", RegexOptions.Compiled);
        var lineMatches = new List<LineMatch>();

        foreach(var f in fileNames)
        {
            var fileInnerText = File.ReadAllLines(f);
            var fileSpecText = new Dictionary<CCDDetail,List<int>>();
            for(int j = 1; j < fileInnerText.Length - 4; ++j) // ignore 1st and last 4 lines of file
            {
                var mat = reg.Match(fileInnerText[j]);
                for(int k=1; k<=5; ++k)
                {
                    var key = new CCDDetail() { FieldId = k, Value = mat.Groups[k].Value };
                    //field and value may occur on multiple lines?
                    if (fileSpecText.ContainsKey(key) == false)
                        fileSpecText.Add(key, new List<int>());
                    fileSpecText[key].Add(j);
                }
            }
            textFileDictionaries.Add(fileSpecText);
        }
        for(int i=0; i<textFileDictionaries.Count - 2; ++i)
        {
            for (int j = i+1; j < textFileDictionaries.Count - 1; ++j)
            {
                foreach(var tup in textFileDictionaries[j])
                {
                    if(textFileDictionaries[i].ContainsKey(tup.Key))
                    {
                        // the field value might occure on multiple lines
                        lineMatches.Add(new LineMatch() { 
                            File1Index=i,
                            File1Lines = textFileDictionaries[i][tup.Key],
                            File2Index=j,
                            File2Lines = textFileDictionaries[j][tup.Key]
                        });
                    }
                }
                /*
                for (int k = 0; k < textFileDictionaries[j].Count; ++k)
                {
                    var key = textFileDictionaries[j].Keys.ToArray()[k];
                    if (textFileDictionaries[i].ContainsKey(key))
                    {
                        // the field value might occure on multiple lines
                        lineMatches.Add(new LineMatch()
                        {
                            File1Index = i,
                            File1Lines = textFileDictionaries[i][key],
                            File2Index = j,
                            File2Lines = textFileDictionaries[j][key]
                        });
                    }
                }
               */
            }
        }
    }

....

public class CCDDetail
{
    public int FieldId { get; set; }
    public string Value { get; set; }

    public override bool Equals(object obj)
    {
        return FieldId == (obj as CCDDetail).FieldId && Value.Equals((obj as CCDDetail).Value);
    }
    public override int GetHashCode()
    {
        return FieldId.GetHashCode() + Value.GetHashCode();
    }
}
public class LineMatch
{
    public int File1Index { get; set; }
    public List<int> File1Lines { get; set; }
    public int File2Index { get; set; }
    public List<int> File2Lines { get; set; }
}

請記住，我的假設是您可以在要比較的任何一個文件的多行中使用相同的字段值。 另外，LineMatch列表需要后處理，因為它包含兩個文件中具有相同字段的所有行的記錄（您可能希望記錄哪個字段編號。

在數組中搜索另一個數組中匹配項的最快方法

問題描述

1 個解決方案

解決方案1
1 已采納 2015-03-11 20:17:56

在數組中搜索另一個數組中匹配項的最快方法

問題描述

1 個解決方案

解決方案1 1 已采納 2015-03-11 20:17:56

解決方案1
1 已采納 2015-03-11 20:17:56