找出2个集合与40K对象的差异

Question

I have 2 collections both containing the same type of object and both collections have approximately 40K objects each. 我有2个集合都包含相同类型的对象，并且两个集合每个都有大约40K对象。

The code for the object each collection contains is basically like a dictionary except I've overridden the equals and hash functions: 每个集合包含的对象的代码基本上就像一个字典，除了我重写了equals和hash函数：

public class MyClass: IEquatable<MyClass>
{
    public int ID { get; set; }
    public string Name { get; set; }

    public override bool Equals(object obj)
    {
        return obj is MyClass && this.Equals((MyClass)obj);
    }

    public bool Equals(MyClass ot)
    {
        if (ReferenceEquals(this, ot))
        {
            return true;
        }

        return 
         ot.ID.Equals(this.ID) &&
         string.Equals(ot.Name, this.Name, StringComparison.OrdinalIgnoreCase); 
    }

    public override int GetHashCode()
    {
         unchecked
         {
             int result = this.ID.GetHashCode();
             result = (result * 397) ^ this.Name.GetSafeHashCode();
             return result;
         }
    }
}

The code I'm using to compare the collections and get the differences is just a simple Linq query using PLinq. 我用来比较集合并获得差异的代码只是使用PLinq的简单Linq查询。

ParallelQuery p1Coll = sourceColl.AsParallel();
ParallelQuery p2Coll = destColl.AsParallel();

List<object> diffs = p2Coll.Where(r => !p1Coll.Any(m => m.Equals(r))).ToList();

Does anybody know of a faster way of comparing this many objects? 有没有人知道比较这么多物体的更快方法？ Currently it's taking about 40 seconds +/- 2 seconds on a quad core computer. 目前在四核计算机上花费大约40秒+/- 2秒。 Would doing some grouping based on the data and then comparing each group of data in parallel possibly be faster? 是否会根据数据进行一些分组，然后并行比较每组数据可能会更快？ If I group the data first based on Name I would end up with about 490 unique objects and if I grouped it by ID first I would end up with about 622 unique objects. 如果我首先根据名称对数据进行分组，我最终会得到大约490个唯一对象，如果我先按ID分组，那么我最终会得到大约622个唯一对象。

Answer 1

You can use Except method which will give you every item from p2Coll that is not in p1Coll . 您可以使用Except方法，该方法将为您提供p2Coll中不在p1Coll所有项目。

var diff = p2Coll.Except(p1Coll);

UPDATE (some performance testing): 更新（一些性能测试）：

Disclaimer: 免责声明：

Actual time depends upon multiple factors (such as content of collections, hardware, what's running on your machine, amount of hashcode collisions etc.) that's why we have complexity and Big O notation (see Daniel Brückner comment). 实际时间取决于多个因素（例如集合的内容，硬件，计算机上运行的内容，哈希码冲突的数量等），这就是为什么我们有复杂性和Big O表示法（参见DanielBrückner评论）。

Here is some performance stats for 10 runs on my 4 years old machine: 以下是我4岁机器上10次运行的性能统计数据：

Median time for Any(): 6973,97658ms
Median time for Except(): 9,23025ms

Source code for my test is available on gist. 我的测试的源代码可以在gist上找到。

UPDATE 2: 更新2：

If you want to have different items from both first and second collection you have to actually do Expect on both and that Union the result: 如果你想拥有不同于第一和第二收集不同的项目，你必须做的其实期待双方和联盟的结果：

var diff = p2Coll.Except(p1Coll).Union(p1Coll.Except(p2Coll));

Answer 2

Intersect 相交

int[] id1 = { 44, 26, 92, 30, 71, 38 };
int[] id2 = { 39, 59, 83, 47, 26, 4, 30 };

IEnumerable<int> both = id1.Intersect(id2);

foreach (int id in both)
    Console.WriteLine(id);

/*
This code produces the following output:

26
30
*/

找出2个集合与40K对象的差异

问题描述

2 个解决方案

解决方案1
15 已采纳 2013-01-08 18:21:58

解决方案2
0 2013-01-08 18:24:01

找出2个集合与40K对象的差异

问题描述

2 个解决方案

解决方案1 15 已采纳 2013-01-08 18:21:58

解决方案2 0 2013-01-08 18:24:01

解决方案1
15 已采纳 2013-01-08 18:21:58

解决方案2
0 2013-01-08 18:24:01