简体   繁体   English

HashSet的快速交集<int>并列出<int></int></int>

[英]Fast intersection of HashSet<int> and List<int>

I have a HashSet<int> and a List<int> (Hashset has approximately 3 Million items, List has approximately 300k items).我有一个HashSet<int>和一个List<int> (Hashset 有大约 300 万个项目,List 有大约 300k 个项目)。

I currently intersect them using我目前使用

var intersected = hashset.Intersect(list).ToArray();

and I wonder if there is any faster way to do so.我想知道是否有更快的方法可以做到这一点。 Maybe in parallel?也许是并行的?

HashSet has a method IntersectWith that is optimized if intersection is performed between two hash sets . HashSet有一个IntersectWith方法, 如果在两个 hash 集之间执行交集,则该方法得到优化 Using method IntersectWith we can intersect HashSet and List using the next approach:使用IntersectWith方法,我们可以使用以下方法将HashSetList相交:

private static IEnumerable<int> Intersect(HashSet<int> hash, List<int> list)
{
    HashSet<int> intersect = new HashSet<int>(list);
    intersect.IntersectWith(hash);
    return intersect;
}

I have measured (using Stopwatch ) performance of your original method ( Linq Intersect ), methods proposed by @TheodorZoulias ( HashSet Contains and HashSet Contains Parallel ) and my method ( HashSet IntersectWith ).我已经测量了(使用Stopwatch )您的原始方法( Linq Intersect )、@TheodorZoulias 提出的方法( HashSet ContainsHashSet Contains Parallel )和我的方法( HashSet IntersectWith )的性能。 Here are results:以下是结果:

------------------------------------------------------------------------
|         Method            | Min, ms | Max, ms | Avg, ms | StdDev, ms |
------------------------------------------------------------------------
| Linq Intersect            |   135   |   274   |   150   |     17     |
| HashSet Contains          |    25   |    44   |    26   |      2     |
| HashSet Contains Parallel |    12   |    53   |    13   |      3     |
| HashSet IntersectWith     |    57   |    89   |    61   |      4     |
------------------------------------------------------------------------

From the table we can see that the fastest method is HashSet Contains Parallel and the slowest is Linq Intersect .从表中我们可以看出,最快的方法是HashSet Contains Parallel ,最慢的是Linq Intersect


Here is complete source code that was used to measure performance.这是用于衡量性能的完整源代码

Yes, you can go faster because you have already a HashSet in hand.是的,你可以更快地 go 因为你手头已经有一个HashSet The LINQ Intersect uses a generic algorithm , that essentially recreates a HashSet from scratch every time it's called. LINQ Intersect使用通用算法,基本上每次调用时都会从头开始重新创建HashSet Here is a faster algorithm:这是一个更快的算法:

/// <summary>Yields all the elements of first (including duplicates) that also
/// appear in second, in the order in which they appear in first.</summary>
public static IEnumerable<TSource> Intersect<TSource>(IEnumerable<TSource> first,
    HashSet<TSource> second)
{
    foreach (TSource element in first)
    {
        if (second.Contains(element)) yield return element;
    }
}

Update: Here is a parallel version of the above idea:更新:这是上述想法的并行版本:

var intersected = list.AsParallel().Where(x => hashset.Contains(x)).ToArray();

I wouldn't expect it to be much faster, if at all, because the workload is too granular .如果有的话,我不希望它更快,因为工作量 太细了 The overhead of calling a lambda 300,000 times will probably overshadow any benefits of the parallelism.调用 lambda 300,000 次的开销可能会掩盖并行性的任何好处。

Also the order of the results will not be preserved, unless the AsOrdered PLINQ method is added in the query, hurting further the performance of the operation.此外,结果的顺序也不会保留,除非在查询中添加AsOrdered PLINQ 方法,否则会进一步损害操作的性能。

It might be faster for you to store lots of integers as a compact bit set rather than as a HashSet or List (at least if you're using List to store unique integers just like HashSet ).将大量整数存储为紧凑位集而不是HashSetList可能会更快(至少如果您使用List来存储唯一整数,就像HashSet一样)。 In this sense, there are several choices:从这个意义上说,有几种选择:

  • The built-in BitArray stores each bit in a compact way.内置的BitArray以紧凑的方式存储每个位。 As an example, if you're storing integers from 1 through 65000, BitArray requires about 8125 bytes of memory (as opposed to 65000 bytes if each bit were stored as an 8-bit byte).例如,如果您要存储从 1 到 65000 的整数, BitArray需要大约 8125 个字节的 memory(如果每个位存储为 8 位字节,则需要 65000 个字节)。 However, BitArray may not be very memory-efficient if the highest set bit is very large (eg, 3 billion), or if the set of bits is sparse (there are huge areas with set bits and/or clear bits).但是,如果最高设置位非常大(例如,30 亿),或者如果位集稀疏(有设置位和/或清除位的巨大区域),则BitArray可能不是非常节省内存。 You can intersect two BitArray s using the Xor method您可以使用Xor方法与两个BitArray相交
  • Compressed bit sets likewise store each bit in a compact way, but also compress parts of themselves to further save memory while still keeping set operations such as intersection efficient.压缩位集同样以紧凑的方式存储每个位,但也会压缩自身的部分以进一步节省 memory,同时仍然保持交叉集等操作的效率。 Examples include Elias-Fano encoding, Roaring Bitmaps, and EWAH.示例包括 Elias-Fano 编码、Roaring Bitmaps 和 EWAH。 See graphs comparing different implementations of compressed bit sets with uncompressed ( FixedBitSet ) in terms of performance and memory (note that they compare Java implementations, but they may still be useful in the .NET case).请参阅在性能和memory方面比较压缩位集与未压缩 ( FixedBitSet ) 的不同实现的图表(请注意,它们比较了 Java 实现,但它们在 Z303CB0EF9EDB9082AZD61BBBE528 案例中可能仍然有用)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM