[英]Fastest way to search an array for matches in another array
背景我需要搜索兩個或更多個文件以查找匹配項。 這些文件可以輕松擁有超過20,000行。 我需要找到最快的方法來搜索它們並找到文件之間的匹配項。
我從來沒有做過這樣的比賽,在那里我可能不止一場比賽,我需要全部歸還。
我知道的:
我當前的方法涉及過多使用IEnumerable LINQ方法。
Dim fileNames As String() = lstFiles.Items.OfType(Of String)().ToArray()
Dim fileText As IEnumerable(Of IEnumerable(Of CCDDetail)) = fileNames.Select(Function(fileName, fileIndex)
Dim list As New List(Of String)({fileName})
list.AddRange(File.ReadAllLines(fileName))
Return list.Where(Function(fileLine, lineIndex) Not {list.Count - 1, list.Count - 2, 0, 1, 2}.Contains(lineIndex)).
Select(Function(fileLine) New CCDDetail(list(0), fileLine.Substring(12, 17).Trim(), fileLine.Substring(29, 10).Trim(), fileLine.Substring(39, 8).Trim(), fileLine.Substring(48, 6).Trim(), fileLine.Substring(54, 22).Trim()))
End Function)
Dim asdf = fileText.
Select(Function(file, inx) file.
Select(Function(fileLine, ix) fileText.
Skip(inx + 1).
Select(Function(fileToSearch) fileLine.MatchesAny(ix, fileToSearch)).
Aggregate(New List(Of Integer)(), Function(old, cc)
Dim lcc As New List(Of Integer)(cc)
lcc.Insert(0, If(old.Count > 0, old(0) + 1, 1))
old.AddRange(lcc)
Return old
End Function)))
CCD詳細功能:
Public Function Matches(ccd2 As CCDDetail) As Boolean
Return CustomerName = ccd2.CustomerName OrElse
DfiAccountNumber = ccd2.DfiAccountNumber OrElse
CustomerRefId = ccd2.CustomerRefId OrElse
PaymentAmount = ccd2.PaymentAmount OrElse
PaymentId = ccd2.PaymentId
End Function
Public Function MatchesAny(index As Integer, ccd2 As IEnumerable(Of CCDDetail)) As IEnumerable(Of Integer)
Return Enumerable.Range(0, ccd2.Count).Where(Function(i) ccd2(i).Matches(Me))
End Function
這適用於我的測試文件,但是使用完整長度的文件大約需要7分鍾。
問題:
有更快的方法嗎? 有性能提示嗎?
謝謝。
更新:
我只是使用字典和正則表達式列表將其減少了很多。 我將完成應用程序,然后進行一些比較差異的測試。
Dim fileNames As String() = lstFiles.Items.OfType(Of String)().ToArray()
Dim textFiles As New List(Of Dictionary(Of Integer, CCDDetail))()
Dim fileInnerText As String()
Dim reg As Regex = New Regex(".{12}(.{17})(.{10})(.{8}).(.{6})(.{22})", RegexOptions.Compiled)
Dim mat As Match
Dim fileSpecText As Dictionary(Of Integer, CCDDetail)
Dim lineMatches As New List(Of Integer())
For i As Integer = 0 To fileNames.Length - 1
fileInnerText = File.ReadAllLines(fileNames(i))
fileSpecText = New Dictionary(Of Integer, CCDDetail)()
For j As Integer = 2 To fileInnerText.Length - 3
mat = reg.Match(fileInnerText(j))
fileSpecText.Add(j, New CCDDetail(mat.Groups(1).Value, mat.Groups(2).Value, mat.Groups(3).Value, mat.Groups(4).Value, mat.Groups(5).Value))
Next
textFiles.Add(fileSpecText)
Next
For i As Integer = 0 To textFiles.Count - 1
'Dim source As Dictionary(Of Integer, CCDDetail) = textFiles(i)
For j As Integer = 2 To textFiles(i).Count - 1 + 2
For k As Integer = i + 1 To textFiles.Count - 1
For l As Integer = 2 To textFiles(k).Count - 1 + 2
If (textFiles(i)(j).Matches(textFiles(k)(l))) Then
lineMatches.Add({i, j, k, l})
End If
Next
Next
Next
Next
請查看我對您問題的評論。 以下(未經測試的)示例代碼顯示了如何使用Dictionary <>可能加快處理速度。 它需要您的“ Update”並從那里進行構建,以便您可以按照我的C#示例進行(抱歉,我沒有寫VB.net)。 這樣做的想法是,將字段用作查找所有匹配行(具有相同字段值的行)的鍵更快。
您的代碼(和我的代碼)可以得到進一步改進,以不將所有文件立即加載到內存中,並且在比較兩個文件時,一次只需要將一個文件加載到字典中。
public void CompareLines(string[] fileNames)
{
var textFileDictionaries = new List<Dictionary<CCDDetail,List<int>>>();
var reg = new Regex(".{12}(.{17})(.{10})(.{8}).(.{6})(.{22})", RegexOptions.Compiled);
var lineMatches = new List<LineMatch>();
foreach(var f in fileNames)
{
var fileInnerText = File.ReadAllLines(f);
var fileSpecText = new Dictionary<CCDDetail,List<int>>();
for(int j = 1; j < fileInnerText.Length - 4; ++j) // ignore 1st and last 4 lines of file
{
var mat = reg.Match(fileInnerText[j]);
for(int k=1; k<=5; ++k)
{
var key = new CCDDetail() { FieldId = k, Value = mat.Groups[k].Value };
//field and value may occur on multiple lines?
if (fileSpecText.ContainsKey(key) == false)
fileSpecText.Add(key, new List<int>());
fileSpecText[key].Add(j);
}
}
textFileDictionaries.Add(fileSpecText);
}
for(int i=0; i<textFileDictionaries.Count - 2; ++i)
{
for (int j = i+1; j < textFileDictionaries.Count - 1; ++j)
{
foreach(var tup in textFileDictionaries[j])
{
if(textFileDictionaries[i].ContainsKey(tup.Key))
{
// the field value might occure on multiple lines
lineMatches.Add(new LineMatch() {
File1Index=i,
File1Lines = textFileDictionaries[i][tup.Key],
File2Index=j,
File2Lines = textFileDictionaries[j][tup.Key]
});
}
}
/*
for (int k = 0; k < textFileDictionaries[j].Count; ++k)
{
var key = textFileDictionaries[j].Keys.ToArray()[k];
if (textFileDictionaries[i].ContainsKey(key))
{
// the field value might occure on multiple lines
lineMatches.Add(new LineMatch()
{
File1Index = i,
File1Lines = textFileDictionaries[i][key],
File2Index = j,
File2Lines = textFileDictionaries[j][key]
});
}
}
*/
}
}
}
....
public class CCDDetail
{
public int FieldId { get; set; }
public string Value { get; set; }
public override bool Equals(object obj)
{
return FieldId == (obj as CCDDetail).FieldId && Value.Equals((obj as CCDDetail).Value);
}
public override int GetHashCode()
{
return FieldId.GetHashCode() + Value.GetHashCode();
}
}
public class LineMatch
{
public int File1Index { get; set; }
public List<int> File1Lines { get; set; }
public int File2Index { get; set; }
public List<int> File2Lines { get; set; }
}
請記住,我的假設是您可以在要比較的任何一個文件的多行中使用相同的字段值。 另外,LineMatch列表需要后處理,因為它包含兩個文件中具有相同字段的所有行的記錄(您可能希望記錄哪個字段編號。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.