简体   繁体   English

gcc std :: unordered_map的执行速度慢吗?如果是这样 - 为什么?

[英]Is gcc std::unordered_map implementation slow? If so - why?

We are developing a highly performance critical software in C++. 我们正在用C ++开发一个高性能的关键软件。 There we need a concurrent hash map and implemented one. 我们需要一个并发的哈希映射并实现一个。 So we wrote a benchmark to figure out, how much slower our concurrent hash map is compared with std::unordered_map . 所以我们写了一个基准来弄清楚,我们的并发哈希映射与std::unordered_map相比要慢多少。

But, std::unordered_map seems to be incredibly slow... So this is our micro-benchmark (for the concurrent map we spawned a new thread to make sure that locking does not get optimized away and note that I never inser 0 because I also benchmark with google::dense_hash_map , which needs a null value): 但是, std::unordered_map似乎非常慢......所以这是我们的微基准测试(对于并发映射,我们产生了一个新的线程,以确保锁定不会被优化掉,并注意我从来没有因为我因为我还可以使用google::dense_hash_map基准测试,这需要一个空值):

boost::random::mt19937 rng;
boost::random::uniform_int_distribution<> dist(std::numeric_limits<uint64_t>::min(), std::numeric_limits<uint64_t>::max());
std::vector<uint64_t> vec(SIZE);
for (int i = 0; i < SIZE; ++i) {
    uint64_t val = 0;
    while (val == 0) {
        val = dist(rng);
    }
    vec[i] = val;
}
std::unordered_map<int, long double> map;
auto begin = std::chrono::high_resolution_clock::now();
for (int i = 0; i < SIZE; ++i) {
    map[vec[i]] = 0.0;
}
auto end = std::chrono::high_resolution_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin);
std::cout << "inserts: " << elapsed.count() << std::endl;
std::random_shuffle(vec.begin(), vec.end());
begin = std::chrono::high_resolution_clock::now();
long double val;
for (int i = 0; i < SIZE; ++i) {
    val = map[vec[i]];
}
end = std::chrono::high_resolution_clock::now();
elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin);
std::cout << "get: " << elapsed.count() << std::endl;

(EDIT: the whole source code can be found here: http://pastebin.com/vPqf7eya ) (编辑:整个源代码可以在这里找到: http//pastebin.com/vPqf7eya

The result for std::unordered_map is: std::unordered_map的结果是:

inserts: 35126
get    : 2959

For google::dense_map : 对于google::dense_map

inserts: 3653
get    : 816

For our hand backed concurrent map (which does locking, although the benchmark is single threaded - but in a separate spawn thread): 对于我们手工支持的并发映射(它执行锁定,虽然基准测试是单线程的 - 但是在单独的生成线程中):

inserts: 5213
get    : 2594

If I compile the benchmark program without pthread support and run everything in the main thread, I get the following results for our hand backed concurrent map: 如果我在没有pthread支持的情况下编译基准程序并在主线程中运行所有内容,我会得到以下手工支持并发映射的结果:

inserts: 4441
get    : 1180

I compile with the following command: 我用以下命令编译:

g++-4.7 -O3 -DNDEBUG -I/tmp/benchmap/sparsehash-2.0.2/src/ -std=c++11 -pthread main.cc

So especially inserts on std::unordered_map seem to be extremely expensive - 35 seconds vs 3-5 seconds for other maps. 所以特别是在std::unordered_map上的插入似乎非常昂贵 - 其他地图的时间为35秒vs 3-5秒。 Also the lookup time seems to be quite high. 查找时间似乎也很高。

My question: why is this? 我的问题:这是为什么? I read another question on stackoverflow where someone asks, why std::tr1::unordered_map is slower than his own implementation. 我在stackoverflow上读了另一个问题,有人问,为什么std::tr1::unordered_map比他自己的实现慢。 There the highest rated answer states, that the std::tr1::unordered_map needs to implement a more complicated interface. 有最高级别的答案状态, std::tr1::unordered_map需要实现更复杂的接口。 But I can not see this argument: we use a bucket approach in our concurrent_map, std::unordered_map uses a bucket-approach too ( google::dense_hash_map does not, but than std::unordered_map should be at least as fast than our hand backed concurrency-safe version?). 但是我看不出这个论点:我们在concurrent_map中使用了bucket方法, std::unordered_map使用了bucket-approach( google::dense_hash_map没有,但是std::unordered_map应该至少和我们的手一样快)支持并发安全版本?)。 Apart from that I can not see anything in the interface that force a feature which makes the hash map perform badly... 除此之外,我在界面中看不到强制使哈希映射表现不佳的功能的任何内容......

So my question: is it true that std::unordered_map seems to be very slow? 所以我的问题是: std::unordered_map似乎很慢吗? If no: what is wrong? 如果不是:出了什么问题? If yes: what is the reason for that. 如果是:这是什么原因。

And my main question: why is inserting a value into a std::unordered_map so terrible expensive (even if we reserve enough space at the beginning, it does not perform much better - so rehashing seems not to be the problem)? 而我的主要问题是:为什么在std::unordered_map插入一个值如此可怕的昂贵(即使我们在开始时保留了足够的空间,它也没有更好的表现 - 所以重新散列似乎不是问题)?

EDIT: 编辑:

First of all: yes the presented benchmark is not flawless - this is because we played around a lot with it and it is just a hack (for example the uint64 distribution to generate ints would in practice not be a good idea, exclude 0 in a loop is kind of stupid etc...). 首先:是的,所提出的基准测试并不完美 - 这是因为我们玩了很多它并且它只是一个黑客(例如生成int的uint64发行版在实践中不是一个好主意,在一个中排除0循环是一种愚蠢的等...)。

At the moment most comments explain, that I can make the unordered_map faster by preallocating enough space for it. 目前大多数评论都解释说,我可以通过为它预分配足够的空间来加快unordered_map的速度。 In our application this is just not possible: we are developing a database management system and need a hash map to store some data during a transaction (for example locking information). 在我们的应用程序中,这是不可能的:我们正在开发一个数据库管理系统,并且需要一个哈希映射来在事务期间存储一些数据(例如锁定信息)。 So this map can be everything from 1 (user just makes one insert and commits) to billions of entries (if full table scans happen). 因此,这个映射可以是从1(用户只进行一次插入和提交)到数十亿条目(如果发生全表扫描)的所有内容。 It is just impossible to preallocate enough space here (and just allocate a lot in the beginning will consume too much memory). 在这里预先分配足够的空间是不可能的(并且在开始时分配很多将消耗太多内存)。

Furthermore, I apologize, that I did not state my question clear enough: I am not really interested in making unordered_map fast (using googles dense hash map works fine for us), I just don't really understand where this huge performance differences come from. 此外,我很抱歉,我没有说清楚我的问题:我对制作unordered_map并不感兴趣(使用googles密集哈希映射对我们来说很好),我只是不明白这个巨大的性能差异来自何处。 It can not be just preallocation (even with enough preallocated memory, the dense map is an order of magnitude faster than unordered_map, our hand backed concurrent map starts with an array of size 64 - so a smaller one than unordered_map). 它不能只是预分配(即使有足够的预分配内存,密集映射比unordered_map快一个数量级,我们的手动并发映射以大小为64的数组开始 - 所以比unordered_map小一个)。

So what is the reason for this bad performance of std::unordered_map ? 那么std::unordered_map这种糟糕表现的原因是什么? Or differently asked: Could one write an implementation of the std::unordered_map interface which is standard conform and (nearly) as fast as googles dense hash map? 或者有不同的问题:是否可以编写std::unordered_map接口的实现,该接口是标准符合和(几乎)与googles密集哈希映射一样快? Or is there something in the standard that enforces the implementer to chose an inefficient way to implement it? 或者标准中是否有某些内容强制实施者选择低效的方式来实现它?

EDIT 2: 编辑2:

By profiling I see that a lot of time is used for integer divions. 通过分析,我看到很多时间用于整数divions。 std::unordered_map uses prime numbers for the array size, while the other implementations use powers of two. std::unordered_map使用素数作为数组大小,而其他实现使用2的幂。 Why does std::unordered_map use prime-numbers? 为什么std::unordered_map使用素数? To perform better if the hash is bad? 如果散列是坏的,要更好地执行? For good hashes it does imho make no difference. 对于好的哈希,它确实没有任何区别。

EDIT 3: 编辑3:

These are the numbers for std::map : 这些是std::map的数字:

inserts: 16462
get    : 16978

Sooooooo: why are inserts into a std::map faster than inserts into a std::unordered_map ... I mean WAT? Sooooooo:为什么插入std::map比插入std::unordered_map更快...我的意思是WAT? std::map has a worse locality (tree vs array), needs to make more allocations (per insert vs per rehash + plus ~1 for each collision) and, most important: has another algorithmic complexity (O(logn) vs O(1))! std::map有一个更糟糕的局部性(树与数组),需要进行更多的分配(每次插入vs每次重新加上+每次碰撞加1次),最重要的是:有另一种算法复杂度(O(logn)vs O( 1))!

I found the reason: it is a Problem of gcc-4.7!! 我找到了原因:这是gcc-4.7的问题!!

With gcc-4.7 gcc-4.7

inserts: 37728
get    : 2985

With gcc-4.6 gcc-4.6

inserts: 2531
get    : 1565

So std::unordered_map in gcc-4.7 is broken (or my installation, which is an installation of gcc-4.7.0 on Ubuntu - and another installation which is gcc 4.7.1 on debian testing). 所以gcc-4.7中的std::unordered_map被破坏了(或者我的安装,这是在Ubuntu上安装gcc-4.7.0 - 另一个是在debian测试中安装gcc 4.7.1)。

I will submit a bug report.. until then: DO NOT use std::unordered_map with gcc 4.7! 我将提交错误报告..直到那时:不要使用std::unordered_map和gcc 4.7!

I am guessing that you have not properly sized your unordered_map , as Ylisar suggested. 我猜你没有正确调整你的unordered_map大小,正如Ylisar建议的那样。 When chains grow too long in unordered_map , the g++ implementation will automatically rehash to a larger hash table, and this would be a big drag on performance. 当链在unordered_map变得太长时,g ++实现将自动重新散列到更大的哈希表,这将对性能产生很大的影响。 If I remember correctly, unordered_map defaults to (smallest prime larger than) 100 . 如果我没记错的话, unordered_map默认为(最小的素数大于) 100

I didn't have chrono on my system, so I timed with times() . 我的系统上没有chrono ,所以我用times()计时。

template <typename TEST>
void time_test (TEST t, const char *m) {
    struct tms start;
    struct tms finish;
    long ticks_per_second;

    times(&start);
    t();
    times(&finish);
    ticks_per_second = sysconf(_SC_CLK_TCK);
    std::cout << "elapsed: "
              << ((finish.tms_utime - start.tms_utime
                   + finish.tms_stime - start.tms_stime)
                  / (1.0 * ticks_per_second))
              << " " << m << std::endl;
}

I used a SIZE of 10000000 , and had to change things a bit for my version of boost . 我使用了10000000SIZE ,并且不得不为我的boost版本改变一些东西。 Also note, I pre-sized the hash table to match SIZE/DEPTH , where DEPTH is an estimate of the length of the bucket chain due to hash collisions. 另请注意,我预先调整哈希表的大小以匹配SIZE/DEPTH ,其中DEPTH是由于哈希冲突导致的桶链长度的估计。

Edit: Howard points out to me in comments that the max load factor for unordered_map is 1 . 编辑: Howard在评论中指出unordered_map的最大加载因子为1 So, the DEPTH controls how many times the code will rehash. 因此, DEPTH控制代码重新发送的次数。

#define SIZE 10000000
#define DEPTH 3
std::vector<uint64_t> vec(SIZE);
boost::mt19937 rng;
boost::uniform_int<uint64_t> dist(std::numeric_limits<uint64_t>::min(),
                                  std::numeric_limits<uint64_t>::max());
std::unordered_map<int, long double> map(SIZE/DEPTH);

void
test_insert () {
    for (int i = 0; i < SIZE; ++i) {
        map[vec[i]] = 0.0;
    }
}

void
test_get () {
    long double val;
    for (int i = 0; i < SIZE; ++i) {
        val = map[vec[i]];
    }
}

int main () {
    for (int i = 0; i < SIZE; ++i) {
        uint64_t val = 0;
        while (val == 0) {
            val = dist(rng);
        }
        vec[i] = val;
    }
    time_test(test_insert, "inserts");
    std::random_shuffle(vec.begin(), vec.end());
    time_test(test_insert, "get");
}

Edit: 编辑:

I modified the code so that I could change out DEPTH more easily. 我修改了代码,以便我可以更轻松地更改DEPTH

#ifndef DEPTH
#define DEPTH 10000000
#endif

So, by default, the worst size for the hash table is chosen. 因此,默认情况下,选择哈希表的最差大小。

elapsed: 7.12 inserts, elapsed: 2.32 get, -DDEPTH=10000000
elapsed: 6.99 inserts, elapsed: 2.58 get, -DDEPTH=1000000
elapsed: 8.94 inserts, elapsed: 2.18 get, -DDEPTH=100000
elapsed: 5.23 inserts, elapsed: 2.41 get, -DDEPTH=10000
elapsed: 5.35 inserts, elapsed: 2.55 get, -DDEPTH=1000
elapsed: 6.29 inserts, elapsed: 2.05 get, -DDEPTH=100
elapsed: 6.76 inserts, elapsed: 2.03 get, -DDEPTH=10
elapsed: 2.86 inserts, elapsed: 2.29 get, -DDEPTH=1

My conclusion is that there is not much significant performance difference for any initial hash table size other than making it equal to the entire expected number of unique insertions. 我的结论是,除了使其等于整个预期的唯一插入数之外,任何初始哈希表大小都没有太大的性能差异。 Also, I don't see the order of magnitude performance difference that you are observing. 此外,我没有看到您正在观察的数量级性能差异。

I have run your code using a 64 bit / AMD / 4 cores (2.1GHz) computer and it gave me the following results: 我使用64位/ AMD / 4内核(2.1GHz)计算机运行您的代码,它给了我以下结果:

MinGW-W64 4.9.2: MinGW-W64 4.9.2:

Using std::unordered_map: 使用std :: unordered_map:

inserts: 9280 
get: 3302

Using std::map: 使用std :: map:

inserts: 23946
get: 24824

VC 2015 with all the optimization flags I know: VC 2015带有我知道的所有优化标志:

Using std::unordered_map: 使用std :: unordered_map:

inserts: 7289
get: 1908

Using std::map: 使用std :: map:

inserts: 19222 
get: 19711

I have not tested the code using GCC but I think it may be comparable to the performance of VC, so if that is true, then GCC 4.9 std::unordered_map it's still broken. 我没有使用GCC测试代码,但我认为它可能与VC的性能相当,所以如果这是真的,那么GCC 4.9 std :: unordered_map它仍然被打破。

[EDIT] [编辑]

So yes, as someone said in the comments, there is no reason to think that the performance of GCC 4.9.x would be comparable to VC performance. 所以是的,正如有人在评论中所说,没有理由认为GCC 4.9.x的性能可与VC性能相媲美。 When I have the change I will be testing the code on GCC. 当我有更改时,我将在GCC上测试代码。

My answer is just to establish some kind of knowledge base to other answers. 我的答案只是为其他答案建立某种知识基础。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM