简体   繁体   中英

Fastest way to remove duplicates from a vector<>

As the title says, I have in my mind some methods to do it but I don't know which is fastest.

So let's say that we have a: vector<int> vals with some values

1

After my vals are added

sort(vals.begin(), vals.end());
auto last = unique(vals.begin(), vals.end());
vals.erase(last, vals.end());

2

Convert to set after my vals are added:

set<int> s( vals.begin(), vals.end() );
vals.assign( s.begin(), s.end() );

3

When i add my vals , i check if it's already in my vector:

if( find(vals.begin(), vals.end(), myVal)!=vals.end() )
    // add my val

4

Use a set from start

Ok, I've got these 4 methods, my questions are:

1 From 1, 2 and 3 which is the fastest?
2 Is 4 faster than the first 3?
3 At 2 after converting the vector to set, it's more convenabile to use the set to do what I need to do or should I do the vals.assign( .. ) and continue with my vector?

Question 1 : Both 1 and 2 are O(n log n), 3 is O(n^2). Between 1 and 2, it depends on the data.

Question 2 : 4 is also O(n log n) and can be better than 1 and 2 if you have lots of duplicates, because it only stores one copy of each. Imagine a million values that are all equal.

Question 3 : Well, that really depends on what you need to do.

The only thing that can be said without knowing more is that your alternative number 3 is asymptotically worse than the others.

If you're using C++11 and don't need ordering, you can use std::unordered_set , which is a hash table and can be significantly faster than std::set .

Option 1 is going to beat all the others. The complexity is just O(N log N) and the contiguous memory of vector keeps the constant factors low.

std::set typically suffers a lot from non-contiguous allocations. It's not just slow to access those, just creating them takes significant time as well.

These methods all have their shortcomings although (1) is worth looking at.

But, take a look at this 5th option: Bear in mind that you can access the vector's data buffer using the data() function. Then, bearing in mind that no reallocation will take place since the vector will only ever get smaller, apply the algorithm that you learn at school:

unduplicate(vals.data(), vals.size());

void unduplicate(int* arr, std::size_t length) /*Reference: Gang of Four, I think*/
{
    int *it, *end = arr + length - 1;
    for (it = arr + 1; arr < end; arr++, it = arr + 1){
        while (it <= end){
            if (*it == *arr){
                *it = *end--;
            } else {
                ++it;
            }
        }
    }
}

And resize the vector at the end if that is what's required. This is never worse than O(N^2), so is superior to insertion-sort or sort then remove approaches.

Your 4th option might be an idea if you can adopt it. Profile the performance. Otherwise use my algorithm from the 1960s.

I've got a similar problem recently, and experimented with 1 , 2 , and 4 , as well as with unordered_set version of 4 . In turned out that the best performance was the latter one, 4 with unordered_set in place of set .

BTW, that empirical finding is not too surprising if one considers that both set and sort were a bit of overkill: they guaranteed relative order of unequal elements. For example inputs 4,3,5,2,4,3 would lead to sorted output of unique values 2,3,4,5 . This is unnecessary if you can live with unique values in arbitrary order, ie 3,4,2,5 . When you use unordered_set it doesn't guarantee the order, only uniqueness, and therefore it doesn't have to perform the additional work of ensuring the order of different elements.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM