简体   繁体   English

C++中的无序集合交集

[英]unordered set intersection in C++

Here is my code, wondering any ideas to make it faster?这是我的代码,想知道有什么办法可以让它更快? My implementation is brute force, which is for any elements in a, try to find if it also in b, if so, put in result set c.我的实现是蛮力,对于a中的任何元素,尝试查找它是否也在b中,如果是,则放入结果集c。 Any smarter ideas is appreciated.任何更聪明的想法都值得赞赏。

#include <iostream>
#include <unordered_set>

int main() {
    std::unordered_set<int> a = {1,2,3,4,5};
    std::unordered_set<int> b = {3,4,5,6,7};
    std::unordered_set<int> c;
    for (auto i = a.begin(); i != a.end(); i++) {
        if (b.find(*i) != b.end()) c.insert(*i);
    }
    for (int v : c) {
        std::printf("%d \n", v);
    }
}

Asymptotically, your algorithm is as good as it can get.渐近地,您的算法已尽其所能。

In practice, I'd add a check to loop over the smaller of the two sets and do lookups in the larger one.在实践中,我会添加一个检查来遍历两个集合中较小的一个,并在较大的一个集合中进行查找。 Assuming reasonably evenly distributed hashes, a lookup in a std::unoredered_set takes constant time.假设合理均匀分布的散列,在std::unoredered_set的查找需要恒定的时间。 So this way, you'll be performing fewer such lookups.这样,您将执行更少的此类查找。

你可以用 std::copy_if()

std::copy_if(a.begin(), a.end(), std::inserter(c, c.begin()), [b](const int element){return b.count(element) > 0;} );

Your algorithm is as good as it gets for a unordered set.您的算法与无序集一样好。 however if you use a std::set (which uses a binary tree as storage) or even better a sorted std::vector , you can do better.但是,如果您使用std::set (使用二叉树作为存储)或者更好的排序std::vector ,您可以做得更好。 The algorithm should be something like:算法应该是这样的:

  1. get iterators to a.begin() and b.begin()获取a.begin()b.begin()迭代器
  2. if the iterators point to equal element add to intersection and increment both iterators.如果迭代器指向相等的元素,则添加到交集并增加两个迭代器。
  3. Otherwise increment the iterator pointing to the smallest value否则递增指向最小值的迭代器
  4. Go to 2.转到 2。

Both should be O(n) time but using a normal set should save you from calculating hashes or any performance degradation that arises from hash collisions.两者都应该是 O(n) 时间,但使用普通集应该可以避免计算哈希或因哈希冲突引起的任何性能下降。

Thanks Angew, why your method is faster?谢谢Angew,为什么你的方法更快? Could you elaborate a bit more?你能再详细一点吗?

Well, let me provide you some additional info...好吧,让为您提供一些额外的信息......

It should be pretty clear that, whichever data structures you use, you will have to iterate over all elements in at least one of those, so you cannot get better than O(n) , n being the number of elements in the data structure selected to iterate over.应该很清楚,无论您使用哪种数据结构,您都必须迭代其中至少一个中的所有元素,因此您不能比O(n)更好, n是所选数据结构中的元素数迭代。 Elementary now is, how fast you can look up the elements in the other structure – with a hash set, which std::unordered_set actually is, this is O(1) – at least if the number of collisions is small enough ( "reasonably evenly distributed hashes" );现在是基本的,你可以多快地在另一个结构中查找元素——使用散列集, std::unordered_set实际上是,这是O(1) ——至少如果冲突的数量足够小( “合理均匀分布的哈希值" ); the degenerate case would be all values having the same key...退化的情况是所有值都具有相同的键......

So far, you get O(n) * O(1) = O(n) .到目前为止,你得到O(n) * O(1) = O(n) But you still the choice: O(n) or O(m) , if m is the number of elements in the other set.但是您仍然可以选择: O(n)O(m) ,如果m是另一个集合中的元素数。 OK, in complexity calculations, this is the same, we have a linear algorithm anyway, in practice, though, you can spare some hash calculations and look-ups if you choose the set with the lower number of elements...好的,在复杂度计算中,这是一样的,无论如何我们都有一个线性算法,但是在实践中,如果您选择元素数量较少的集合,您可以省去一些哈希计算和查找......

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM