简体   繁体   中英

When does a map become better than two vectors?

A map does binary search on all its elements, which has logarithmic complexity — this means that for a small enough collection of objects, a map will perform worse than two vectors that have linear search.

How large should the object (key) pool be for a map to start performing better than two vectors?

Edit: A more generalized version of the question: how large should the object pool be for binary search to perform better than linear search?

I'm using strings as keys and the values are pointers, but my particular use case probably shouldn't matter. I'm more curious to understand how to use the two tools properly.

If you'll forgive my saying so, most of the answers sound to me like various ways of saying: "I don't know", without actually admitting that they don't know. While I generally agree with the advice they've given, none of them seems to have attempted to directly address the question you asked: what is the break-even point.

To be fair, when I read the question, I didn't really know either. It's one of those things what we all know the basics: for a small enough collection, a linear search will probably be faster, and for a large enough collection, a binary search will undoubtedly be faster. I, however, have never really had much reason to investigate anything about what the break-even point would really be. Your question got me curious, however, so I decided to write a bit of code to get at least some idea.

This code is definitely a very quick hack (lots of duplication, only currently supports one type of key, etc.) but at least it might give some idea of what to expect:

#include <set>
#include <vector>
#include <string>
#include <time.h>
#include <iostream>
#include <algorithm>

int main() {   

    static const int reps = 100000;

    std::vector<int> data_vector;
    std::set<int> data_set;
    std::vector<int> search_keys;

    for (int size=10; size<100; size += 10) {
        data_vector.clear();
        data_set.clear();
        search_keys.clear();

        int num_keys = size / 10;   
        for (int i=0; i<size; i++) {
            int key = rand();

            if (i % num_keys == 0)
                search_keys.push_back(key);

            data_set.insert(key);
            data_vector.push_back(key);
        }

        // Search for a few keys that probably aren't present.
        for (int i=0; i<10; i++)
            search_keys.push_back(rand());

        long total_linear =0, total_binary = 0;

        clock_t start_linear = clock();
        for (int k=0; k<reps; k++) {
            for (int j=0; j<search_keys.size(); j++) {
                std::vector<int>::iterator pos = std::find(data_vector.begin(), data_vector.end(), search_keys[j]);
                if (pos != data_vector.end())
                    total_linear += *pos;
            }
        }
        clock_t stop_linear = clock();                      

        clock_t start_binary = clock();
        for (int k=0; k<reps; k++) {
            for (int j=0; j<search_keys.size(); j++) {
                std::set<int>::iterator pos = data_set.find(search_keys[j]);
                if (pos != data_set.end())
                    total_binary += *pos;
            }
        }
        clock_t stop_binary = clock();

        std::cout << "\nignore: " << total_linear << " ignore also: " << total_binary << "\n";

        std::cout << "For size = " << size << "\n";
        std::cout << "\tLinear search time = " << stop_linear - start_linear << "\n";
        std::cout << "\tBinary search time = " << stop_binary - start_binary << "\n";
    }
    return 0;
}

Here are the results I get running this on my machine:

ignore: 669830816 ignore also: 669830816
For size = 10
        Linear search time = 37
        Binary search time = 45

ignore: 966398112 ignore also: 966398112
For size = 20
        Linear search time = 60
        Binary search time = 47

ignore: 389263520 ignore also: 389263520
For size = 30
        Linear search time = 83
        Binary search time = 63

ignore: -1561901888 ignore also: -1561901888
For size = 40
        Linear search time = 106
        Binary search time = 65

ignore: -1741869184 ignore also: -1741869184
For size = 50
        Linear search time = 127
        Binary search time = 69

ignore: 1130798112 ignore also: 1130798112
For size = 60
        Linear search time = 155
        Binary search time = 75

ignore: -1949669184 ignore also: -1949669184
For size = 70
        Linear search time = 173
        Binary search time = 83

ignore: -1991069184 ignore also: -1991069184
For size = 80
        Linear search time = 195
        Binary search time = 90

ignore: 1750998112 ignore also: 1750998112
For size = 90
        Linear search time = 217
        Binary search time = 79

Obviously that's not the only possible test (or even close to the best one possible), but it seems to me that even a little hard data is better than none at all.

Edit: I would note for the record that I see no reason that code using two vectors (or a vector of pairs) can't be just as clean as code using a set or map. Obviously you'd want to put the code for it into a small class of its own, but I see no reason at all that it couldn't present precisely the same interface to the outside world that map does. In fact, I'd probably just call it a "tiny_map` (or something on that order).

One of the basic points of OO programming (and it remains the case in generic programming, to least some degree) is to separate the interface from the implementation. In this case, you're talking about purely an implementation detail that need not affect the interface at all. In fact, if I were writing a standard library, I'd be tempted to incorporate this as a "small map optimization" analogous to the common small string optimization. Basically, just allocate an array of 10 (or so) objects of value_type directly in the map object itself, and use them when/if the map is small, then move the data to a tree iff it grows large enough to justify it. The only real question is whether people us tiny maps often enough to justify the work.

The code for map will be much cleaner than the code for two vector s; that should be your primary concern. Only once you've determined that the performance of map is a problem in your own application should you worry about the difference, and at that point you should benchmark it yourself.

A map will never equal a sorted vector search speed, because the vector memory is better packed, and memory access is more straightforward.

However, it's only half the tale. As soon as you start modifying the container, node-base containers (which do not need moving around their elements to accomodate insertions in the middle), get a theoritecal advantage. Like the search, the better cache locality still gives an edge to vector for a small number of elements.

Exact measurements are hard to provide and depend on specific optimizations, context usage and machine, so I would advise to simply go by functionality. A map has a natural interface for searching by key, so using it will make your life easier.

If you really find yourself measuring and need to really squeeze performance out of the use case, then you probably need a more specialized data structure. Look up B-Trees.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM