[英]Fast intersection of HashSet<int> and List<int>
I have a HashSet<int>
and a List<int>
(Hashset has approximately 3 Million items, List has approximately 300k items).我有一个HashSet<int>
和一个List<int>
(Hashset 有大约 300 万个项目,List 有大约 300k 个项目)。
I currently intersect them using我目前使用
var intersected = hashset.Intersect(list).ToArray();
and I wonder if there is any faster way to do so.我想知道是否有更快的方法可以做到这一点。 Maybe in parallel?也许是并行的?
HashSet
has a method IntersectWith
that is optimized if intersection is performed between two hash sets . HashSet
有一个IntersectWith
方法, 如果在两个 hash 集之间执行交集,则该方法得到优化。 Using method IntersectWith
we can intersect HashSet
and List
using the next approach:使用IntersectWith
方法,我们可以使用以下方法将HashSet
和List
相交:
private static IEnumerable<int> Intersect(HashSet<int> hash, List<int> list)
{
HashSet<int> intersect = new HashSet<int>(list);
intersect.IntersectWith(hash);
return intersect;
}
I have measured (using Stopwatch
) performance of your original method ( Linq Intersect
), methods proposed by @TheodorZoulias ( HashSet Contains
and HashSet Contains Parallel
) and my method ( HashSet IntersectWith
).我已经测量了(使用Stopwatch
)您的原始方法( Linq Intersect
)、@TheodorZoulias 提出的方法( HashSet Contains
和HashSet Contains Parallel
)和我的方法( HashSet IntersectWith
)的性能。 Here are results:以下是结果:
------------------------------------------------------------------------
| Method | Min, ms | Max, ms | Avg, ms | StdDev, ms |
------------------------------------------------------------------------
| Linq Intersect | 135 | 274 | 150 | 17 |
| HashSet Contains | 25 | 44 | 26 | 2 |
| HashSet Contains Parallel | 12 | 53 | 13 | 3 |
| HashSet IntersectWith | 57 | 89 | 61 | 4 |
------------------------------------------------------------------------
From the table we can see that the fastest method is HashSet Contains Parallel
and the slowest is Linq Intersect
.从表中我们可以看出,最快的方法是HashSet Contains Parallel
,最慢的是Linq Intersect
。
Here is complete source code that was used to measure performance.这是用于衡量性能的完整源代码。
Yes, you can go faster because you have already a HashSet
in hand.是的,你可以更快地 go 因为你手头已经有一个HashSet
。 The LINQ Intersect
uses a generic algorithm , that essentially recreates a HashSet
from scratch every time it's called. LINQ Intersect
使用通用算法,基本上每次调用时都会从头开始重新创建HashSet
。 Here is a faster algorithm:这是一个更快的算法:
/// <summary>Yields all the elements of first (including duplicates) that also
/// appear in second, in the order in which they appear in first.</summary>
public static IEnumerable<TSource> Intersect<TSource>(IEnumerable<TSource> first,
HashSet<TSource> second)
{
foreach (TSource element in first)
{
if (second.Contains(element)) yield return element;
}
}
Update: Here is a parallel version of the above idea:更新:这是上述想法的并行版本:
var intersected = list.AsParallel().Where(x => hashset.Contains(x)).ToArray();
I wouldn't expect it to be much faster, if at all, because the workload is too granular .如果有的话,我不希望它更快,因为工作量 太细了。 The overhead of calling a lambda 300,000 times will probably overshadow any benefits of the parallelism.调用 lambda 300,000 次的开销可能会掩盖并行性的任何好处。
Also the order of the results will not be preserved, unless the AsOrdered
PLINQ method is added in the query, hurting further the performance of the operation.此外,结果的顺序也不会保留,除非在查询中添加AsOrdered
PLINQ 方法,否则会进一步损害操作的性能。
It might be faster for you to store lots of integers as a compact bit set rather than as a HashSet
or List
(at least if you're using List
to store unique integers just like HashSet
).将大量整数存储为紧凑位集而不是HashSet
或List
可能会更快(至少如果您使用List
来存储唯一整数,就像HashSet
一样)。 In this sense, there are several choices:从这个意义上说,有几种选择:
BitArray
stores each bit in a compact way.内置的BitArray
以紧凑的方式存储每个位。 As an example, if you're storing integers from 1 through 65000, BitArray
requires about 8125 bytes of memory (as opposed to 65000 bytes if each bit were stored as an 8-bit byte).例如,如果您要存储从 1 到 65000 的整数, BitArray
需要大约 8125 个字节的 memory(如果每个位存储为 8 位字节,则需要 65000 个字节)。 However, BitArray
may not be very memory-efficient if the highest set bit is very large (eg, 3 billion), or if the set of bits is sparse (there are huge areas with set bits and/or clear bits).但是,如果最高设置位非常大(例如,30 亿),或者如果位集稀疏(有设置位和/或清除位的巨大区域),则BitArray
可能不是非常节省内存。 You can intersect two BitArray
s using the Xor
method您可以使用Xor
方法与两个BitArray
相交FixedBitSet
) in terms of performance and memory (note that they compare Java implementations, but they may still be useful in the .NET case).请参阅在性能和memory方面比较压缩位集与未压缩 ( FixedBitSet
) 的不同实现的图表(请注意,它们比较了 Java 实现,但它们在 Z303CB0EF9EDB9082AZD61BBBE528 案例中可能仍然有用)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.