简体   繁体   English

为什么将排序键插入 std::set 比插入随机键快得多?

[英]Why is inserting sorted keys into std::set so much faster than inserting shuffled keys?

I was accidentally surprised to found that inserting sorted keys into std::set is much much faster than inserting shuffled keys.我意外地惊讶地发现,将排序的键插入std::set比插入随机键快得多。 This is somewhat counterintuitive since a red-black tree (I verified that std::set is implemented as a red-black tree on my system) as a self-balanced binary search tree, would need to do a lot of rebalancing opeartions to insert a sequence of sorted keys, thus inserting sorted keys should take more time than inserting shuffled keys.这有点违反直觉,因为作为自平衡二叉搜索树的红黑树(我验证std::set在我的系统上实现为红黑树)需要进行大量重新平衡操作才能插入一系列排序的键,因此插入排序的键应该比插入打乱的键花费更多的时间。

But the fact is, inserting sorted keys can be up to 15 times faster than inserting shuffled keys: Here is my testing code and some results:但事实是,插入排序的键可以比插入随机键快 15 倍:这是我的测试代码和一些结果:

#include <algorithm>
#include <chrono>
#include <iostream>
#include <random>
#include <set>
#include <vector>
using namespace std;

int64_t insertion_time(const vector<int> &keys) {    
        auto start = chrono::system_clock::now();
        set<int>(keys.begin(), keys.end());
        auto stop = chrono::system_clock::now();
        auto elapsed = chrono::duration_cast<chrono::milliseconds>(stop - start);
        return elapsed.count(); 
}

int main() {
    size_t test_size;
    cout << "test size: ";
    cin >> test_size;
    vector<int> keys(test_size);
    for (int i = 0; i < test_size; ++i) {
        keys[i] = i;
    }
    
    // whether shuffled case or sorted case took first was irrelevant and results were similar
    auto rng = std::default_random_engine {};
    shuffle(keys.begin(), keys.end(), rng);
    cout << "shuffled: " << insertion_time(keys) << endl;

    sort(keys.begin(), keys.end());
    cout << "sorted: " << insertion_time(keys) << endl;

    return 0;
}
// i7 8700, 32 GB RAM, WIN10 2004, g++ -O3 main.cpp
// An interesting observation is that the difference becomes larger as test_size being larger.
// Similar results showed up for my handwritten red-black tree and other
// machines( or other compilers, operating systems etc)

C:\Users\Leon\Desktop\testSetInsertion>a
test size: 1000000
shuffled: 585
sorted: 96

C:\Users\Leon\Desktop\testSetInsertion>a
test size: 3000000
shuffled: 2480
sorted: 296

C:\Users\Leon\Desktop\testSetInsertion>a
test size: 5000000
shuffled: 4805
sorted: 484

C:\Users\Leon\Desktop\testSetInsertion>a
test size: 10000000
shuffled: 11537
sorted: 977

C:\Users\Leon\Desktop\testSetInsertion>a
test size: 30000000
shuffled: 46239
sorted: 3076

Anyone explains this please?有人解释一下吗? I guess that this has something to do with cache locality since when inserting sorted keys, rebalancing typically involves those nodes that were most recently inserted.我猜这与缓存局部性有关,因为在插入排序键时,重新平衡通常涉及最近插入的那些节点。 But above is just my guess and I know very little about cache locality.但以上只是我的猜测,我对缓存局部性知之甚少。

If you look at https://en.cppreference.com/w/cpp/container/set/set如果您查看https://en.cppreference.com/w/cpp/container/set/set

you can see:你可以看到:

Complexity复杂
[..] [..]
2) N log(N) where N = std::distance(first, last) in general, linear in N if the range is already sorted by value_comp(). 2) N log(N)其中N = std::distance(first, last)通常,如果范围已经按 value_comp() 排序,则在N中是线性的。

We can use insert in loop with end() as hint which is amortized constant with correct hint.我们可以使用带有end()的循环insert作为提示,该提示是具有正确提示的摊销常量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM