检查stl容器中是否已存在值的最快方法

Question

我拿着一个非常大的内存地址列表（大约400.000 ），需要检查某个地址是否已经每秒存在400.000次。

一个代码示例来说明我的设置：

std::set<uintptr_t> existingAddresses; // this one contains 400.000 entries

while (true) {
    // a new list with possible new addresses
    std::set<uintptr_t> newAddresses; // also contains about ~400.000 entries

    // in my own code, these represent a new address list
    for (auto newAddress : newAddresses) {

        // already processed this address, skip it
        if (existingAddresses.find(newAddress) != existingAddresses.end()) {
          continue;
        }

        // we didn't have this address yet, so process it.
        SomeHeavyTask(newAddress);

        // so we don't process it again
        existingAddresses.emplace(newAddress);
    }

    Sleep(1000);
}

这是我提出的第一个实现，我认为可以大大改善。

接下来，我想出了使用一些自定义索引策略的方法，该策略也用在数据库中。 想法是取一部分值，并用它在自己的组集中建立索引。 例如，如果我使用地址的最后两个数字，我将有16^2 = 256组可放入地址。

所以我最终会得到这样的地图：

[FF] -> all address ending with `FF`
[EF] -> all addresses ending with `EF`
[00] -> all addresses ending with `00`
// etc...

这样，我只需要对对应集合中的360个条目进行查找。 导致每秒完成40万次约360查找。 好多了！

我想知道是否还有其他技巧或更好的方法来做到这一点？ 我的目标是使此地址查找尽可能快。

Answer 1

std::set<uintptr_t>使用平衡树，因此查找时间为O(log N) 。

另一方面， std::unordered_set<uintptr_t>基于哈希，查找时间为O(1) 。

尽管这只是一种asymptotic complexity度量，这意味着由于所涉及的恒定因素而无法保证改善，但是当集合包含40万个元素时，差异可能会很明显。

Answer 2

您可以使用类似于合并的算法：

std::set<uintptr_t> existingAddresses; // this one contains 400.000 entries

while (true) {
    // a new list with possible new addresses
    std::set<uintptr_t> newAddresses; // also contains about ~400.000 entries
    auto existing_it = existingAddresses.begin();
    auto new_it = newAddresses.begin();

    while (new_it != newAddresses.end() && existing_it != existingAddresses.end()) {
        if (*new_it < *existing_it) {
            // we didn't have this address yet, so process it.
            SomeHeavyTask(*new_it);
            // so we don't process it again
            existingAddresses.insert(existing_it, *new_it);
            ++new_it;
        } else if (*existing_it < *new_it) {
            ++existing_it;
        } else { // Both equal
            ++existing_it;
            ++new_it;
        }
    }
    for (new_it != newAddresses.end())
        // we didn't have this address yet, so process it.
        SomeHeavyTask(*new_it);
        // so we don't process it again
        existingAddresses.insert(existingAddresses.end(), *new_it);
        ++new_it;
    }
    Sleep(1000);
}

现在复杂度是线性的： O(N + M)而不是O(N log M) （具有N个新地址， M代表旧地址）。

检查stl容器中是否已存在值的最快方法

问题描述

2 个解决方案

解决方案1
11 已采纳 2017-02-27 11:58:10

解决方案2
1 2017-02-27 12:45:05

检查stl容器中是否已存在值的最快方法

问题描述

2 个解决方案

解决方案1 11 已采纳 2017-02-27 11:58:10

解决方案2 1 2017-02-27 12:45:05

解决方案1
11 已采纳 2017-02-27 11:58:10

解决方案2
1 2017-02-27 12:45:05