简体   繁体   English

合并矢量而无需额外的内存

[英]Merging vectors without extra memory

I came across this code segment where two vectors are merged where elements from one vector is favored in case of duplication: 我遇到了此代码段,其中合并了两个向量,其中在重复的情况下优先使用一个向量中的元素:

std::vector<String> fields1 = fieldSource1.get();
std::vector<String> fields2 = fieldSource2.get();
// original
fields1.insert(std::end(fields1), std::begin(fields2), std::end(fields2));
std::stable_sort(std::begin(fields1), std::end(fields1));
fields1.erase(std::unique(std::begin(fields1), std::end(fields1)), std::end(fields1));
return fields1;

Given that Strings are unique in their respective vector, and that order of Strings in output vector is irrelevent, I think that I can make this algorithm more efficient. 鉴于Strings在其各自的向量中是唯一的,并且输出向量中Strings的顺序是无关紧要的,因此我认为我可以使该算法更有效。

I would like to avoid extra memory allocation of std::set_union() and std::set_diff(). 我想避免std :: set_union()和std :: set_diff()的额外内存分配。

(Directly inserting std::set_diff to an original vector is not an option due to iterator invalidation during resizing) (由于在调整大小期间迭代器无效,因此不能将std :: set_diff直接插入到原始向量中)

I ended up with this, which is std::set_diff with one iterator replaced with an index: 我最终得到了结果,它是std :: set_diff,其中一个迭代器替换为索引:

std::sort(std::begin(fields1), std::end(fields1));
std::sort(std::begin(fields2), std::end(fields2));
// Initialize iterators by index in case of resizing
size_t index = 0;
size_t end = std::size(fields1);
std::remove_copy_if(std::begin(fields2), std::end(fields2), std::back_inserter(fields1),
[&fields1, &index, end](String field)->bool{
    auto begin = std::begin(fields1);
    found = std::lower_bound(begin+index, begin+end, field);
    index = std::distance(begin, found);
    return (*found) == field;
});
return fields1;

My question is: can I make this merge operation more efficient? 我的问题是:我可以使此合并操作更有效吗? If not, can I make it more readable? 如果没有,我可以使其更具可读性吗?

Representing a bunch of strings as a vector is inefficient if you want to keep them in a sorted or mergeable state. 如果要将一串字符串表示为向量,则要使其保持排序或可合并状态,效率很低。 Better to use another container such as std::set or std::unordered_set which has much better performance guarantees. 最好使用另一个具有更好性能保证的容器,例如std :: set或std :: unordered_set。

Be aware that any solution that tries to sort strings in place, will probably fragment memory further, and increase memory pressure a lot more than creating the correct data structures in the first place. 请注意,任何尝试在适当位置对字符串进行排序的解决方案都可能会进一步使内存碎片化,并且比首先创建正确的数据结构还要多的增加内存压力。

If you must keep them as a vector of strings, then you might consider creating a hash table of all the strings that have been seen at each point, and then only permitting strings to be inserted whose hash has not yet been seen. 如果必须将它们保留为字符串向量,则可以考虑创建一个在每个点都可见的所有字符串的哈希表,然后仅允许插入尚未看到其哈希的字符串。 If you have a great deal of duplicates, this method may be more performant than sorting each list independently. 如果您有大量重复项,则此方法可能比独立排序每个列表更有效。

typedef std::size_t hash_type;
typedef std::string value_type;
typedef std::vector< value_type > values_type;
typedef std::hash< value_type > value_hash_type;
typedef std::unordered_set< hash_type > hash_set_type;

bool is_new_hash(hash_set_type &hash_set,
    const hash_type one_hash
    )
{
    if (hash_set.find(one_hash) == hash_set.end())
    {
        hash_set.insert(one_hash);
        return true;
    }
    return false;
}

int main()
{
    values_type str1, str2, dest;
    str1.push_back("c");
    str1.push_back("a");
    str1.push_back("b");

    str2.push_back("c");
    str2.push_back("d");

    hash_set_type hash_set;
    value_hash_type value_hash;

    for (auto &s : str1)
    {
        if (is_new_hash( hash_set, value_hash( s ) ))
            dest.push_back(s);
    }
    for (auto &s : str2)
    {
        if (is_new_hash(hash_set, value_hash(s)))
            dest.push_back(s);
    }
    std::sort(dest.begin(), dest.end());
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM