简体   繁体   English

如何在C ++中提高std :: set_intersection的性能?

[英]How to improve std::set_intersection performance in C++?

During experimenting with std::set in C++ and set() in Python I faced performance issue that I can't explain. 在使用C ++中的std :: set和Python中的set()进行实验期间,我遇到了无法解释的性能问题。 Set intersection in C++ at least 3 times slower than in Python. 在C ++中设置交集比在Python中至少慢3倍。

So could anybody point me at optimization that could be done to C++ code and/or explain how Python do this so much faster? 那么有人可以指出我可以对C ++代码进行优化和/或解释Python如何更快地完成这项工作吗?

I expect that both of them use similar algorithm with O(n) complexity while set is ordered. 我希望他们都使用类似的算法,其中O(n)复杂度,而set是有序的。 But probably Python do some optimizations so it reach smaller coefficient. 但是,Python可能会做一些优化,因此它会达到更小的系数。

set_bench.cc set_bench.cc

#include <iostream>
#include <set>
#include <algorithm>
#include <iterator>
#include <chrono>
#include <functional>
#include <thread>

void elapsed(std::function<void()> f, const std::string& s)
{
    auto start = std::chrono::steady_clock::now();
    f();
    std::chrono::duration<double> elapsed = std::chrono::steady_clock::now() - start;
    std::cout << s << " " << elapsed.count() << " seconds" << std::endl;
}

template <typename T>
void fill_set(std::set<T>& s, T start, T end, T step)
{
    for (T i = start; i < end; i += step) {
        s.emplace(i);
    }
}

template <typename T>
void intersect(const std::set<T>& s1, const std::set<T>& s2, std::set<T>& result)
{
    std::set_intersection(s1.begin(), s1.end(),
                            s2.begin(), s2.end(),
                            std::inserter(result, result.begin()));
}

int main()
{
    std::set<int64_t> s1;
    std::set<int64_t> s2;
    std::set<int64_t> s3;

    elapsed(std::bind(fill_set<int64_t>, std::ref(s1), 8, 1000*1000*100, 13), "fill s1 took");
    elapsed(std::bind(fill_set<int64_t>, std::ref(s2), 0, 1000*1000*100, 7), "fill s2 took");

    std::cout << "s1 length = " << s1.size() << ", s2 length = " << s2.size() << std::endl;

    elapsed(std::bind(intersect<int64_t>, std::ref(s1), std::ref(s2), std::ref(s3)), "intersect s1 and s2 took");

    std::cout << "s3 length = " << s3.size() << std::endl;

    // sleep to let check memory consumption
    // while (true) std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}

set_bench.py set_bench.py

#!/usr/bin/env python3

import time

def elapsed(f, s):
    start = time.monotonic()
    f()
    elapsed = time.monotonic() - start
    print(f'{s} {elapsed} seconds')

def fill_set(s, start, end, step=1):
    for i in range(start, end, step):
        s.add(i)

def intersect(s1, s2, result):
    result.update(s1 & s2)

s1 = set()
s2 = set()

elapsed(lambda : fill_set(s1, 8, 1000*1000*100, 13), 'fill s1 took')
elapsed(lambda : fill_set(s2, 0, 1000*1000*100, 7), 'fill s2 took')

print(f's1 length = {len(s1)}, s2 length = {len(s2)}')


s3 = set()

elapsed(lambda: intersect(s1, s2, s3), 'intersect s1 and s2 took')

print(f's3 length = {len(s3)}')

# sleep to let check memory consumption
# while True: time.sleep(1)

Here is results of running this programs in next environment: 以下是在下一个环境中运行此程序的结果:

  • clang version 7.0.1 clang版本7.0.1
  • gcc 8.2.0 gcc 8.2.0
  • Python 3.7.2 Python 3.7.2
  • i7-7700 CPU @ 3.60GHz i7-7700 CPU @ 3.60GHz
$ clang -lstdc++ -O0 set_bench.cc -o set_bench && ./set_bench
fill s1 took 5.38646 seconds
fill s2 took 10.5762 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 1.48387 seconds
s3 length = 1098901
$ clang -lstdc++ -O1 set_bench.cc -o set_bench && ./set_bench
fill s1 took 3.31435 seconds
fill s2 took 6.41415 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 1.01276 seconds
s3 length = 1098901
$ clang -lstdc++ -O2 set_bench.cc -o set_bench && ./set_bench
fill s1 took 1.90269 seconds
fill s2 took 3.85651 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 0.512727 seconds
s3 length = 1098901
$ clang -lstdc++ -O3 set_bench.cc -o set_bench && ./set_bench
fill s1 took 1.92473 seconds
fill s2 took 3.72621 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 0.523683 seconds
s3 length = 1098901
$ gcc -lstdc++ -O3 set_bench.cc -o set_bench && time ./set_bench
fill s1 took 1.72481 seconds
fill s2 took 3.3846 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 0.516702 seconds
s3 length = 1098901
$ python3.7 ./set_bench.py 
fill s1 took 0.9404696229612455 seconds
fill s2 took 1.082577683031559 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 0.17995300807524472 seconds
s3 length = 1098901

As you can see results are equal so I assume both programs do the same calculations. 正如您所看到的结果相同所以我假设两个程序都进行相同的计算。

By the way - RSS for C++ program is 1084896 kB and for Python - 1590400 kB. 顺便说一句 - RSS for C ++程序是1084896 kB和for Python - 1590400 kB。

There are two questions in this post: 这篇文章有两个问题:

Q: How to improve std::set_intersection performance in C++? 问: 如何在C ++中提高std::set_intersection性能?

Use sorted std::vector s instead of sets, that's much more cache-friendly. 使用排序的std::vector而不是set,这对缓存更友好。 Since the intersection is done sequentially in a single pass, it will be as fast as it can get. 由于交叉是在一次通过中顺序完成的,因此它将尽可能快地完成。 On my system I got a 0.04 s run time. 在我的系统上,我得到了0.04秒的运行时间。 Stop here if this is all you needed. 如果这就是你所需要的,请停在这里。

Q: ... how [does] Python do this so much faster? 问: ...... Python如何更快地完成这项工作?

Or in other words " why is Python's set faster than the C++ set? ". 或者换句话说“ 为什么Python的设置比C ++设置更快? ”。 I'll focus on this question for the rest of my post. 我会在帖子的其余部分专注于这个问题。

First of all, a Python's set is a hash table and std::set is a binary tree . 首先,Python的set是一个散列表,std::set是一个二叉树 So use std::unordered_set to compare apples to apples (we reject a binary tree at this point based on its O( logN ) lookup complexity). 因此,使用std::unordered_set来比较苹果和苹果(我们此时基于其O( logN )查找复杂性拒绝二叉树)。

Note also that std::set_intersection is simply a two-pointer algorithm ; 另请注意, std::set_intersection只是一个双指针算法 ; it iterates over two sorted sets, keeping only matching values. 它迭代两个有序集,只保留匹配值。 Apart from it's name, there is nothing in common with Python's set_intersection , which by itself is just a simple loop: 除了它的名字,Python的set_intersection没有任何set_intersection ,它本身只是一个简单的循环:

  • Iterate over the smaller hashtable 迭代较小的哈希表
  • For each element, if it exists in the other hashtable, add it to the result 对于每个元素,如果它存在于其他哈希表中,则将其添加到结果中

So we can't use std::set_intersection on unsorted data, and need to implement the loop: 所以我们不能在未排序的数据上使用std::set_intersection ,并且需要实现循环:

    for (auto& v : set1) {
        if (set2.find(v) != set2.end()) {
            result.insert(v);
        }
    }

Nothing fancy here. 没什么好看的。 Unfortunately though a straightforward application of this algorithm on std::unordered_set is still slower by a factor of 3. How can that be? 不幸的是,虽然这个算法在std::unordered_set上的直接应用仍然慢了3倍。这怎么可能?

  1. We observe that the input data set is > 100MB in size. 我们观察到输入数据集大小> 100MB。 That won't fit in 8MB the cache of an i7-7700, which means the more work you can fit within the boundaries of 8MB, the faster your program will perform. 这将不适合8MB的i7-7700缓存,这意味着你可以在8MB的范围内完成更多工作,程序执行的速度就越快。

  2. Python uses a special form of "dense hash table" similar to that of PHP hash table (generally a class of open addressing hash tables), whereas the C++ std::unordered_set is typically a naïve, or vector-of-lists , hash table. Python使用一种特殊形式的“密集哈希表”,类似于PHP哈希表 (通常是一类开放寻址哈希表),而C ++ std::unordered_set通常是一个天真的,或列表向量的哈希表。 A dense structure is much more cache-friendly, and thus faster. 密集的结构对缓存更友好,因此更快。 For implementation details see dictobject.c and setobject.c . 有关实现的详细信息,请参阅dictobject.csetobject.c

  3. The built-in C++ std::hash<long> is too complex for the already unique input data set you're generating. 内置的C ++ std::hash<long>对于您正在生成的已经唯一的输入数据集而言过于复杂。 Python on the other hand uses an identity (no-op) hashing function for integers up to 2 30 (see long_hash ). 另一方面,Python对最多2 30的整数使用标识(无操作)散列函数(请参阅long_hash )。 Collisions are amortised by the LCG built into their hashtable implementation. 冲突由内置于哈希表实现中的LCG分摊。 You can't match that with C++ standard library features; 您无法将其与C ++标准库功能相匹配; an identity hash here will unfortunately again result in a too sparse hashtable. 不幸的是,这里的身份哈希会再次导致一个太稀疏的哈希表。

  4. Python uses a custom memory allocator pymalloc , which is similar to jemalloc and optimized for data locality. Python使用自定义内存分配器pymalloc ,它类似于jemalloc并针对数据局部性进行了优化。 It generally outperform the built-in Linux tcmalloc, which is what a C++ program would normally use. 它通常优于内置的Linux tcmalloc,这是C ++程序通常使用的内容。

With that knowledge we can contrive a similarly performing C++ version, to demonstrate the technical feasibility: 有了这些知识,我们就可以设计出类似性能的C ++版本,以展示技术可行性:

#include <iostream>
#include <unordered_set>
#include <algorithm>
#include <iterator>
#include <chrono>
#include <functional>
#include <thread>
#include <tuple>
#include <string>

using namespace std::chrono_literals;

void elapsed(std::function<void()> f, const std::string& s)
{
    auto start = std::chrono::steady_clock::now();
    f();
    auto end = std::chrono::steady_clock::now();
    std::cout << s << " " << (end - start) / 1.0s << " seconds" << std::endl;
}

template <typename T>
struct myhash {
    size_t operator()(T x) const {
        return x / 5; // cheating to improve data locality
    }
};

template <typename T>
using myset = std::unordered_set<T, myhash<T>>;

template <typename T>
void fill_set(myset<T>& s, T start, T end, T step)
{
    s.reserve((end - start) / step + 1);
    for (T i = start; i < end; i += step) {
        s.emplace(i);
    }
}

template <typename T>
void intersect(const myset<T>& s1, const myset<T>& s2, myset<T>& result)
{
    result.reserve(s1.size() / 4); // cheating to compete with a better memory allocator
    for (auto& v : s1)
    {
        if (s2.find(v) != s2.end())
            result.insert(v);
    }
}

int main()
{
    myset<int64_t> s1;
    myset<int64_t> s2;
    myset<int64_t> s3;

    elapsed(std::bind(fill_set<int64_t>, std::ref(s1), 8, 1000 * 1000 * 100, 13), "fill s1 took");
    elapsed(std::bind(fill_set<int64_t>, std::ref(s2), 0, 1000 * 1000 * 100, 7), "fill s2 took");

    std::cout << "s1 length = " << s1.size() << ", s2 length = " << s2.size() << std::endl;

    elapsed(std::bind(intersect<int64_t>, std::ref(s1), std::ref(s2), std::ref(s3)), "intersect s1 and s2 took");

    std::cout << "s3 length = " << s3.size() << std::endl;
}

With this code I got 0.28 s run times in both C++ and Python versions. 使用此代码,我在C ++和Python版本中都获得了0.28秒的运行时间。

Now if we want to beat Python's set performance, we can remove all cheats and use Google's dense_hash_set , which implements open addressing with quadratic probing, as a drop-in replacement (just need to call set_empty_object(0) ). 现在,如果我们想要击败 Python的设置性能,我们可以删除所有作弊并使用Google的dense_hash_set ,它实现了使用二次探测的开放寻址 ,作为直接替换(只需要调用set_empty_object(0) )。

With google::dense_hash_set and a no-op hashing function, we get: 使用google::dense_hash_set和无操作哈希函数,我们得到:

fill s1 took 0.321397 seconds
fill s2 took 0.529518 seconds
s1 length = 7692308, s2 length = 14285714
intersect s1 and s2 took 0.0974416 seconds
s3 length = 1098901

Or 2.8 times faster than Python, while keeping the hash set functionality! 或者比Python快2.8倍,同时保持哈希集功能!


PS One would think - why does the C++ standard library implement such a slow hash table? PS One会想 - 为什么C ++标准库实现了这么慢的哈希表呢? The no-free-lunch theorem also applies here: a probing-based solution is not always fast; 无免费午餐定理也适用于此:基于探测的解决方案并不总是很快; being an opportunistic solution, it sometimes suffers from "clumping" (endlessly probing into occupied space). 作为一种机会主义的解决方案,它有时会遭遇“聚集”(无休止地探究被占领的空间)。 And when that happens, performance drop exponentially. 当发生这种情况时,性能会呈指数下降。 The idea behind the standard library implementation was to guarantee predictable performance for all possible inputs. 标准库实现背后的想法是保证所有可能输入的可预测性能。 Unfortunately though the caching effect on modern hardware is too great to be neglected, as Chandler Carruth explains in his talk . 不幸的是,尽管Chandler Carruth在他的演讲中解释说,现代硬件的缓存效果太大而不容忽视。

Using sorted vector will far out-perform set on this benchmark: 使用有序vector将远远超出此基准的set

#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>
#include <chrono>
#include <functional>
#include <thread>

void elapsed(std::function<void()> f, const std::string& s)
{
    auto start = std::chrono::steady_clock::now();
    f();
    std::chrono::duration<double> elapsed = std::chrono::steady_clock::now() - start;
    std::cout << s << " " << elapsed.count() << " seconds" << std::endl;
}

template <typename T>
void fill_set(std::vector<T>& s, T start, T end, T step)
{
    for (T i = start; i < end; i += step) {
        s.emplace_back(i);
    }
    std::sort(s.begin(), s.end());
}

template <typename T>
void intersect(const std::vector<T>& s1, const std::vector<T>& s2, std::vector<T>& result)
{
    std::set_intersection(s1.begin(), s1.end(),
                            s2.begin(), s2.end(),
                            std::inserter(result, result.begin()));
}

int main()
{
    std::vector<int64_t> s1;
    std::vector<int64_t> s2;
    std::vector<int64_t> s3;

    elapsed(std::bind(fill_set<int64_t>, std::ref(s1), 8, 1000*1000*100, 13), "fill s1 took");
    elapsed(std::bind(fill_set<int64_t>, std::ref(s2), 0, 1000*1000*100, 7), "fill s2 took");

    std::cout << "s1 length = " << s1.size() << ", s2 length = " << s2.size() << std::endl;

    elapsed(std::bind(intersect<int64_t>, std::ref(s1), std::ref(s2), std::ref(s3)), "intersect s1 and s2 took");

    std::cout << "s3 length = " << s3.size() << std::endl;

    // sleep to let check memory consumption
    // while (true) std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}

For me (clang/libc++ -O3) this took the results from: 对我来说(clang / libc ++ -O3)这取得了以下结果:

fill s1 took 2.01944 seconds
fill s2 took 3.98959 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 1.55453 seconds
s3 length = 1098901

to: 至:

fill s1 took 0.143026 seconds
fill s2 took 0.20209 seconds
s1 length = 7692308, s2 length = 14285715
intersect s1 and s2 took 0.0548819 seconds
s3 length = 1098901

The reason for this performance difference is far fewer allocations in the vector version. 这种性能差异的原因是vector版本中的分配要少得多。

You are not comparing like with like. 你不是喜欢比较喜欢。

Python sets are unordered (hash) sets. Python集是无序(散列)集。 std::set<> is an ordered set (a binary tree). std::set<>是一个有序集(二叉树)。

From the python docs: 从python文档:

5.4. 5.4。 Sets Python also includes a data type for sets. 集Python还包括集的数据类型。 A set is an unordered collection with no duplicate elements. 集合是无序集合 ,没有重复元素。 Basic uses include membership testing and eliminating duplicate entries. 基本用途包括成员资格测试和消除重复条目。 Set objects also support mathematical operations like union, intersection, difference, and symmetric difference. 集合对象还支持数学运算,如并集,交集,差异和对称差异。

refactoring to compare like with like: 重构比较喜欢:

#include <iostream>
#include <unordered_set>
#include <algorithm>
#include <iterator>
#include <chrono>
#include <functional>
#include <thread>
#include <tuple>

void elapsed(std::function<void()> f, const std::string& s)
{
    auto start = std::chrono::steady_clock::now();
    f();
    std::chrono::duration<double> elapsed = std::chrono::steady_clock::now() - start;
    std::cout << s << " " << elapsed.count() << " seconds" << std::endl;
}

template <typename T>
void fill_set(std::unordered_set<T>& s, T start, T end, T step)
{
    for (T i = start; i < end; i += step) {
        s.emplace(i);
    }
}

template <typename T>
void intersect(const std::unordered_set<T>& s1, const std::unordered_set<T>& s2, std::unordered_set<T>& result)
{
    auto ordered_refs = [&]()
    {
        if (s1.size() <= s2.size())
            return std::tie(s1, s2);
        else
            return std::tie(s2, s1);
    };

    auto lr = ordered_refs();
    auto& l = std::get<0>(lr);
    auto& r = std::get<1>(lr);
    result.reserve(l.size());

    for (auto& v : l)
    {
        if (auto i = r.find(v) ; i != r.end())
            result.insert(v);
    }
}

int main()
{
    std::unordered_set<int64_t> s1;
    std::unordered_set<int64_t> s2;
    std::unordered_set<int64_t> s3;

    elapsed(std::bind(fill_set<int64_t>, std::ref(s1), 8, 1000*1000*100, 13), "fill s1 took");
    elapsed(std::bind(fill_set<int64_t>, std::ref(s2), 0, 1000*1000*100, 7), "fill s2 took");

    std::cout << "s1 length = " << s1.size() << ", s2 length = " << s2.size() << std::endl;

    elapsed(std::bind(intersect<int64_t>, std::ref(s1), std::ref(s2), std::ref(s3)), "intersect s1 and s2 took");

    std::cout << "s3 length = " << s3.size() << std::endl;

    // sleep to let check memory consumption
    // while (true) std::this_thread::sleep_for(std::chrono::milliseconds(1000));
}

performance will depend on your kit. 性能取决于您的套件。

I suspect you can increase performance vastly with a custom allocator. 我怀疑你可以通过自定义分配器大大提高性能。 The default one is thread-safe, etc. 默认的是线程安全等。

Having said this, on my machine I only saw a 20% speedup with the unordered version. 话虽如此,在我的机器上我只看到了无序版本的20%加速。 I'd hazard a guess that the python intersect code has been hand-optimised. 我猜测python交叉码已经手动优化了。

For reference, python source code is here: https://github.com/python/cpython/blob/master/Objects/setobject.c 供参考,python源代码在这里: https//github.com/python/cpython/blob/master/Objects/setobject.c

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM