简体   繁体   English

C ++中的有效集合并集和交集

[英]Efficient set union and intersection in C++

Given two sets set1 and set2, I need to compute the ratio of their intersection by their union. 给定两个集合set1和set2,我需要通过它们的并集计算它们的交集之比。 So far, I have the following code: 到目前为止,我有以下代码:

double ratio(const set<string>& set1, const set<string>& set2)
{
    if( set1.size() == 0 || set2.size() == 0 )
        return 0;

    set<string>::const_iterator iter;
    set<string>::const_iterator iter2;
    set<string> unionset;

    // compute intersection and union
    int len = 0;
    for (iter = set1.begin(); iter != set1.end(); iter++) 
    {
        unionset.insert(*iter);
        if( set2.count(*iter) )
            len++;
    }
    for (iter = set2.begin(); iter != set2.end(); iter++) 
        unionset.insert(*iter);

    return (double)len / (double)unionset.size();   
}

It seems to be very slow (I'm calling the function about 3M times, always with different sets). 这似乎很慢(我调用该函数大约3M次,总是使用不同的集合)。 The python counterpart, on the other hand, is way much faster 另一方面,Python对应程序要快得多

def ratio(set1, set2):
    if not set1 or not set2:
        return 0
    return len(set1.intersection(set2)) / len(set1.union(set2))

Any idea about how to improve the C++ version (possibly, not using Boost)? 关于如何改善C ++版本的任何想法(可能不使用Boost)吗?

You don't actually need to construct the union set. 您实际上不需要构造联合集。 In Python terms, len(s1.union(s2)) == len(s1) + len(s2) - len(s1.intersection(s2)) ; 用Python术语, len(s1.union(s2)) == len(s1) + len(s2) - len(s1.intersection(s2)) ; the size of the union is the sum of the sizes of s1 and s2 , minus the number of elements counted twice, which is the number of elements in the intersection. 并集的大小是s1s2的大小之和,减去两次计数的元素数量,即相交中的元素数量。 Thus, you can do 因此,您可以

for (const string &s : set1) {
    len += set2.count(s);
}
return ((double) len) / (set1.size() + set2.size() - len)

It can be done in linear time, without new memory: 可以在线性时间内完成,而无需新的内存:

double ratio(const std::set<string>& set1, const std::set<string>& set2)
{
    if (set1.empty() || set2.empty()) {
        return 0.;
    }
    std::set<string>::const_iterator iter1 = set1.begin();
    std::set<string>::const_iterator iter2 = set2.begin();
    int union_len = 0;
    int intersection_len = 0;
    while (iter1 != set1.end() && iter2 != set2.end()) 
    {
        ++union_len;
        if (*iter1 < *iter2) {
            ++iter1;
        } else if (*iter2 < *iter1) {
            ++iter2;
        } else { // *iter1 == *iter2
            ++intersection_len;
            ++iter1;
            ++iter2;
        }
    }
    union_len += std::distance(iter1, set1.end());
    union_len += std::distance(iter2, set2.end());
    return static_cast<double>(intersection_len) / union_len;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM