[英]Efficient set intersection of a collection of sets in C++
I have a collection of std::set
. 我有一个std::set
。 I want to find the intersection of all the sets in this collection, in the fastest manner. 我想以最快的方式找到该集合中所有集合的交集。 The number of sets in the collection is typically very small (~5-10), and the number of elements in each set is is usually less than 1000, but can occasionally go upto around 10000. But I need to do these intersections tens of thousands of time, as fast as possible. 集合中的集合数量通常很小(〜5-10),每个集合中的元素数量通常少于1000,但偶尔可以增加到10000左右。但是我需要做这些交集成千上万的时间,尽快。 I tried to benchmark a few methods as follows: 我尝试对几种方法进行基准测试,如下所示:
std::set
object which initially copies the first set. std::set
对象中的就地交集,该对象最初复制第一组。 Then for subsequent sets, it iterates over all element of itself and the ith set of the collection, and removes items from itself as needed. 然后,对于后续集合,它会迭代其自身的所有元素以及集合的第i个集合,并根据需要从自身中删除项目。 std::set_intersection
into a temporary std::set
, swap contents to a current set, then again find intersection of the current set with the next set and insert into the temp set, and so on. 使用std::set_intersection
到临时std::set
,将内容交换到当前集合,然后再次找到当前集合与下一个集合的交集,并插入到临时集合中,依此类推。 vector
as the destination container instead of std::set
. 像1)中一样手动遍历所有集合的所有元素,但是使用vector
代替std::set
作为目标容器。 std::list
instead of a vector
, suspecting a list
will provide faster deletions from the middle. 与4中相同,但是使用std::list
而不是vector
,怀疑list
会从中间提供更快的删除速度。 std::unordered_set
) and checking for all items in all sets. 使用哈希集( std::unordered_set
)并检查所有集中的所有项目。 As it turned out, using a vector
is marginally faster when the number of elements in each set is small, and list
is marginally faster for larger sets. 事实证明,当每个集合中的元素数量较小时,使用vector
的速度略快,而对于更大的集合,使用list
的速度略快。 In-place using set
is a substantially slower than both, followed by set_intersection
and hash sets. 就地使用set
要比两者都慢得多,其次是set_intersection
和哈希集。 Is there a faster algorithm/datastructure/tricks to achieve this? 是否有更快的算法/数据结构/技巧来实现这一目标? I can post code snippets if required. 如果需要,我可以发布代码段。 Thanks! 谢谢!
You might want to try a generalization of std::set_intersection()
: the algorithm is to use iterators for all sets: 您可能想尝试std::set_intersection()
的概括:算法是对所有集合使用迭代器:
end()
of its corresponding set, you are done. 如果有任何迭代器到达其对应集合的end()
,则操作完成。 Thus, it can be assumed that all iterators are valid. 因此,可以假定所有迭代器都是有效的。 x
. 将第一个迭代器的值作为下一个候选值x
。 std::find_if()
the first element at least as big as x
. 在迭代器列表中移动,并在第一个元素std::find_if()
中移动至少与x
一样大的元素。 x
make it the new candidate value and search again in the sequence of iterators. 如果该值大于x
则将其设为新的候选值,然后按迭代器顺序再次搜索。 x
you found an element of the intersection: Record it, increment all iterators, start over. 如果所有迭代器都在值x
您找到了交集的元素:记录该交集,增加所有迭代器,重新开始。 Night is a good adviser and I think I may have an idea ;) 晚上是个好顾问,我想我可能有个主意;)
This is why where speeds matter, a vector
(or perhaps a deque
) are so great structures: they play very well with memory. 这就是为什么速度很重要的原因, vector
(或deque
)是如此出色的结构:它们在内存中发挥得很好。 As such, I would definitely recommend using vector
as our intermediary structures; 因此,我绝对建议使用vector
作为我们的中介结构; although care need be taken to only ever insert/delete from an extremity to avoid relocation. 尽管只需要小心地从四肢插入/删除四肢,以避免重新定位。
So I thought about a rather simple approach: 所以我想到了一个相当简单的方法:
#include <cassert>
#include <algorithm>
#include <set>
#include <vector>
// Do not call this method if you have a single set...
// And the pointers better not be null either!
std::vector<int> intersect(std::vector< std::set<int> const* > const& sets) {
for (auto s: sets) { assert(s && "I said no null pointer"); }
std::vector<int> result; // only return this one, for NRVO to kick in
// 0. Check obvious cases
if (sets.empty()) { return result; }
if (sets.size() == 1) {
result.assign(sets.front()->begin(), sets.front()->end());
return result;
}
// 1. Merge first two sets in the result
std::set_intersection(sets[0]->begin(), sets[0]->end(),
sets[1]->begin(), sets[1]->end(),
std::back_inserter(result));
if (sets.size() == 2) { return result; }
// 2. Merge consecutive sets with result into buffer, then swap them around
// so that the "result" is always in result at the end of the loop.
std::vector<int> buffer; // outside the loop so that we reuse its memory
for (size_t i = 2; i < sets.size(); ++i) {
buffer.clear();
std::set_intersection(result.begin(), result.end(),
sets[i]->begin(), sets[i]->end(),
std::back_inserter(buffer));
swap(result, buffer);
}
return result;
}
It seems correct , I cannot guarantee its speed though, obviously. 看来是正确的 ,但是显然我不能保证它的速度。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.