简体   繁体   中英

Intersection of two sets in most optimized way

Given two sets of values, I have to find whether there is any common element among them or not ie whether their intersection is null or not.

Which of the standard C# collection will suit best (in terms of performance) for this purpose? I know that linq has a Intersect extension method to find out the intersection of two list/arrays but my focus is on performance in terms of Big-O notation .

And what if I have to find out the intersection of two sets as well?

Well, if you use LINQ's Intersect method it will build up a HashSet of the second sequence, and then check each element of the first sequence against it. So it's O(M+N)... and you can use foo.Intersect(bar).Any() to get an early-out.

Of course, if you store one (either) set in a HashSet<T> to start with, you can just iterate over the other one checking for containment on each step. You'd still need to build the set to start with though.

Fundamentally you've got an O(M+N) problem whatever you do - you're not going to get cheaper than that (there's always the possibility that you'll have to look at every element) and if your hash codes are reasonable, you should be able to achieve that complexity easily. Of course, some solutions may give better constant factors than others... but that's performance rather than complexity ;)

EDIT: As noted in the comments, there's also ISet<T>.Overlaps - if you've already got either set with a static type of ISet<T> or a concrete implementation, calling Overlaps makes it clearer what you're doing. If both of your sets are statically typed as ISet<T> , use larger.Overlaps(smaller) (where larger and smaller are in terms of the size of the set) as I'd expect an implementation of Overlaps to iterate over the argument and check each element against contents of the set you call it on.

As mentioned , Applying Any() will give you some performance.

I tested it on pretty big dataset and it gave 25% improvements.

Also applying larger.Intersect(smaller) rather than the opposite is very important, in my case, it gave 35% improvements.

Also ordering the list before applying intersect gave another 7-8%.

Another thing to keep in mind that depending on the use case you can completely avoid applying intersect.

For example, for an integer list, if the maximum and minimum are not within the same bounders you don't need to apply intersect since they will never do.

The same goes for a string list with the same idea applied to first letter.

Again depending on your case, try as much as possible to find a rule where intersection is impossible to avoid calling it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM