使用c#進行大量數量比較

Question

比較數字集太慢了。 有什么更有效的方法來解決這個問題？

我有兩組集合，每組大約有 500 萬個集合，每個集合有 6 個數字，每個數字在 1 到 100 之間。集合和組沒有排序和重復。

以下是示例。

No.     Group A                 Group B
1       {1,2,3,4,5,6}           {6,2,4,87,53,12}
2       {2,3,4,5,6,8}           {43,6,78,23,96,24}
3       {45,23,57,79,23,76}     {12,1,90,3,2,23}
4       {3,5,85,24,78,90}       {12,65,78,9,23,13}
        ...                     ...

我的目標是比較兩組，並在我的筆記本電腦上按 5 小時內的最大常見元素數對 A 組進行分類。

在示例中，A 組的第 1 號和 B 組的第 3 號具有 3 個公共元素（1,2,3）。 此外，A 組的第 2 號和 B 組的第 3 號具有 2 個共同元素（2,3）。 因此，我將 A 組分類如下。

No.     Group A             Maximum Common Element Count
1       {1,2,3,4,5,6}           3
2       {2,3,4,5,6,8}           3
3       {45,23,57,79,23,76}     1
4       {3,5,85,24,78,90}       2
        ...

我的方法是比較每個集合和數字，所以復雜度是 A 組計數 * B 組計數 * 6 * 6。因此它需要很多時間。

Dictionary<int, List<int>> Classified = new Dictionary<int, List<int>>();
foreach (List<int> setA in GroupA)
{
    int maxcount = 0;
    foreach (List<int> setB in GroupB)
    {
        int count = 0; 
        foreach(int elementA in setA)
        {
            foreach(int elementB in setB)
            {
                if (elementA == elementB) count++;
            }
        }
        if (count > maxcount) maxcount = count;
    }
    Classified.Add(maxcount, setA);
}

Answer 1

這是我的嘗試 - 使用HashSet<int>並預先計算每個集合的范圍以避免設置到設置的比較，例如{1,2,3,4,5,6}和{7,8,9,10,11,12} （正如馬特的回答所指出的那樣）。

對我來說（使用隨機集運行）它使原始代碼的速度提高了 130 倍。 你在評論中提到

現在執行時間超過 3 天，所以其他人說我需要並行化。

並且在問題本身中

我的目標是比較兩組，並在我的筆記本電腦上按 5 小時內的最大常見元素數對 A 組進行分類。

所以假設評論意味着您的數據的執行時間超過 3 天（72 小時），但您希望它在 5 小時內完成，您只需要提高 14 倍的速度。

框架

我創建了一些類來運行這些基准測試：

Range - 采用一些int值，並跟蹤最小值和最大值。

 public class Range { private readonly int _min; private readonly int _max; public Range(IReadOnlyCollection<int> values) { _min = values.Min(); _max = values.Max(); } public int Min { get { return _min; } } public int Max { get { return _max; } } public bool Intersects(Range other) { if ( _min < other._max ) return false; if ( _max > other._min ) return false; return true; } }

SetWithRange - 包裝一個HashSet<int>和一個Range的值。

 public class SetWithRange : IEnumerable<int> { private readonly HashSet<int> _values; private readonly Range _range; public SetWithRange(IReadOnlyCollection<int> values) { _values = new HashSet<int>(values); _range = new Range(values); } public static SetWithRange Random(Random random, int size, Range range) { var values = new HashSet<int>(); // Random.Next(int, int) generates numbers in the range [min, max) // so we need to add one here to be able to generate numbers in [min, max]. // See https://docs.microsoft.com/en-us/dotnet/api/system.random.next var min = range.Min; var max = range.Max + 1; while ( values.Count() < size ) values.Add(random.Next(min, max)); return new SetWithRange(values); } public int CommonValuesWith(SetWithRange other) { // No need to call Intersect on the sets if the ranges don't intersect if ( !_range.Intersects(other._range) ) return 0; return _values.Intersect(other._values).Count(); } public IEnumerator<int> GetEnumerator() { return _values.GetEnumerator(); } IEnumerator IEnumerable.GetEnumerator() { return GetEnumerator(); } }

結果是使用SetWithRange.Random生成的，如下所示：

 const int groupCount = 10000; const int setSize = 6; var range = new Range(new[] { 1, 100 }); var generator = new Random(); var groupA = Enumerable.Range(0, groupCount) .Select(i => SetWithRange.Random(generator, setSize, range)) .ToList(); var groupB = Enumerable.Range(0, groupCount) .Select(i => SetWithRange.Random(generator, setSize, range)) .ToList();

下面給出的時間是在我的機器上平均運行三個 x64 版本構建的時間。

對於所有情況，我生成了包含 10000 個隨機集的組，然后通過使用擴展到近似 500 萬個集的執行時間

timeFor5Million = timeFor10000 / 10000 / 10000 * 5000000 * 5000000
                = timeFor10000 * 250000

結果

四個foreach塊：

平均時間 = 48628ms； 500 萬套的估計時間 = 3377 小時

var result = new Dictionary<SetWithRange, int>(); foreach ( var setA in groupA ) { int maxcount = 0; foreach ( var setB in groupB ) { int count = 0; foreach ( var elementA in setA ) { foreach ( int elementB in setB ) { if ( elementA == elementB ) count++; } } if ( count > maxcount ) maxcount = count; } result.Add(setA, maxcount); }

在外部foreach上具有並行化的三個foreach塊：

平均時間 = 10305ms； 500 萬套的估計時間 = 716 小時（比原始快 4.7 倍）：

 var result = new Dictionary<SetWithRange, int>(); Parallel.ForEach(groupA, setA => { int maxcount = 0; foreach ( var setB in groupB ) { int count = 0; foreach ( var elementA in setA ) { foreach ( int elementB in setB ) { if ( elementA == elementB ) count++; } } if ( count > maxcount ) maxcount = count; } lock ( result ) result.Add(setA, maxcount); });

使用HashSet<int>並添加一個Range以僅檢查相交的集合：
平均時間 = 375ms； 500 萬套的估計時間 = 24 小時（比原始快 130 倍）：
```
 var result = new Dictionary<SetWithRange, int>(); Parallel.ForEach(groupA, setA => { var commonValues = groupB.Max(setB => setA.CommonValuesWith(setB)); lock ( result ) result.Add(setA, commonValues); });
```
鏈接到此處的在線演示： https : //dotnetfiddle.net/Kxpagh （請注意，.NET Fiddle 將執行時間限制為 10 秒，並且由於顯而易見的原因，其結果比在正常環境中運行的速度慢）。

Answer 2

我能想到的最快的是：

由於您的所有數字都來自有限范圍 (1-100)，您可以將每個集合表示為 100 位二進制數<d1,d2,...,d100>其中dn等於 1，如果n在集合中.

然后比較兩個集合意味着兩個二進制表示上的二進制AND並計算集合位（這可以有效地完成）

除此之外，此任務可以並行化（您的輸入是不可變的，因此非常簡單）。

Answer 3

您必須使用較小的集合對其進行基准測試，但由於您將不得不進行5E6 * 5E6 = 25E12比較，您不妨先對5E6 + 5E6 = 10E6集合的內容進行排序。

然后設置比較的設置變得非常快，因為一旦您達到比較第一側的最高數字，您就可以在每個比較中停止。 對比每組節省的微不足道，但加起來卻是數萬億倍。

您還可以更進一步，按最低條目和最高條目索引兩組 500 萬。 您將進一步顯着減少比較次數。 最后，這只是100 * 100' = 10,000 = 1E4不同的集合。 您永遠不必將最高數字為 12 的集合與任何以 13 或更多開頭的集合進行比較。 有效地避免了大量的工作。

在我看來，這是對大量數據進行排序，但為了實際設置的數量來設置比較，您必須進行原始設置。 在這里，您正在消除所有 0 的工作，並且在您進行比較時如果條件合適，則可以提前中止。

正如其他人所說，並行化......

PS： 5E6 = 5 * 10^6 = 5,000,000和25E12 = 25 * 10^12 = 25 * 10,000,000,000,000

Answer 4

您提出的任何算法的時間復雜度都將是相同的。 HashSets可能會快一點，但如果是這樣，它不會太多 - 36 次直接列表比較與 12 次哈希集查找的開銷不會顯着增加，如果有的話，但您必須進行基准測試. 考慮到每組將被比較數百萬次，預排序可能會有所幫助。 僅供參考，for 循環比 List 上的 foreach 循環快，數組比 List 快（數組上的 for 和 foreach 具有相同的性能），這對於這樣的事情可能會產生不錯的性能差異。 如果No.列是連續的，那么我也會使用數組而不是字典。 數組查找比字典查找快一個數量級。

我認為除了並行化之外，您通常會盡快執行此操作，通過上述微優化可以獲得一些小的收益。

當前算法與目標執行時間相差多遠？

Answer 5

我會使用以下內容：

foreach (List<int> setA in GroupA)
{
    int maxcount = GroupB.Max(x => x.Sum(y => setA.Contains(y) ? 1 : 0));
    Classified.Add(maxcount, setA);
}

使用c#進行大量數量比較

問題描述

5 個解決方案

解決方案1
4 2020-01-03 14:16:49

框架

結果

解決方案2
2 已采納 2020-01-03 01:47:16

解決方案3
1 2020-01-03 01:57:59

解決方案4
1 2020-01-03 02:17:28

解決方案5
-1 2020-01-03 01:58:16

使用c#進行大量數量比較

問題描述

5 個解決方案

解決方案1 4 2020-01-03 14:16:49

框架

結果

解決方案2 2 已采納 2020-01-03 01:47:16

解決方案3 1 2020-01-03 01:57:59

解決方案4 1 2020-01-03 02:17:28

解決方案5 -1 2020-01-03 01:58:16

解決方案1
4 2020-01-03 14:16:49

解決方案2
2 已采納 2020-01-03 01:47:16

解決方案3
1 2020-01-03 01:57:59

解決方案4
1 2020-01-03 02:17:28

解決方案5
-1 2020-01-03 01:58:16