简体   繁体   中英

What's the fastest way to search for duplicates in an array without sorting it?

I have an array 1 2 2 3 4. I wish to find the number of duplicates of an element after its index only. So the number of duplicates of first 2 is 1 and the number of duplicates of second 2 is 0. How can I achieve this?

Put elements that you see into a hash-based map.

Starting from the back of your collection, go backward, and add items to the hash map. If an element you are about to add is not there, set its duplicate count to zero, and put 1 into the map for that element. If a count is already there, then its duplicate count is whatever is in the map. Store that number as the duplicate count, and increment the value in the map.

vector<int> data({1, 2, 2, 3, 4});
unordered_map<int,int> count;
vector<int> res(data.size(), 0);
for (int i = data.size()-1 ; i >= 0 ; i--) {
    res[i] = count[data[i]]++;
}
for (int i = 0 ; i != res.size() ; i++) {
    cout << data[i] << " - " << res[i] << endl;
}

Demo on ideone.

If n is the size of the array and i is an index of an element then you need for each element to scan n - i - 1 elements. As the result you will do n * ( n - 1 ) comparisons of elements.

You can use standard algorithm std::count

For example

const size_t N = 5;

int a[N] = { 1, 2, 2, 3, 4 };

for ( int *first = a; first != a + N; ++first )
{
   std::cout << *first << '\t' << std::count( first, a + N, *first ) - 1 << std::endl;
} 

Or

for ( int *first = a; first != a + N; ++first )
{
   std::cout << *first << '\t' << std::count( first + 1, a + N, *first ) << std::endl;
} 

The same can be written also as

for ( auto *first = std::begin( a ); first != std::end( a ); ++first )
{
   std::cout << *first << '\t' << std::count( first, std::end( a ), *first ) - 1 << std::endl;
} 

or as

for ( auto *first = std::begin( a ); first != std::end( a ); ++first )
{
   std::cout << *first << '\t' << std::count( std::next( first ), std::end( a ), *first ) << std::endl;
} 

Don't know if this would be the fastest approach, but my offer would be to:

  • Make a secondary array with same number of elements, initialize them with 0 s
  • Check duplicates of the last element;
    • Mark the second from last duplicate with 1 ,
    • then the third from last with 2
    • and so on...
  • Check duplicates of the elements from the last to the first, skip if the element has duplicate mark other than 0

Like this in C:

#include <stdio.h>
#define Length 10

int main( ) {

    int SomeNumbers[Length] = { 1, 2, 2, 3, 4, 5, 20, 9, 2, 3 };
    int DupCount[Length] = { 0 };

    for ( int i = Length - 1; i >= 0; i-- ) {
        if ( DupCount[i] == 0 ) {
            int dup = 0;
            for ( int j = i - 1; j >= 0; j-- )
                if ( SomeNumbers[i] == SomeNumbers[j] )
                    DupCount[j] = ++dup;
        }
    }

    for ( int i = 0; i < Length; i++ ) printf( "%d ", DupCount[i] );

    getchar( );
    return 0;

}

The most efficient approach in terms of speed would typically be to use a frequency table. Normally, it is a structure which maps a value to the number of times it occurs. In this case, you could map to a list/array of indices instead (ie the index of each places where the value occurred).

The algorithm would go through each element, and add it to the table. If a duplicate is found, it gets appended the list/array of indices at that location in the map.

If you need to know how many duplicates there are eg of the number 2, then lookup its entry in the table. The number of indices stored there is the total number of duplicates. To find the number of duplicates after a given instance of the value, simply check how many indices occur after the desired index.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM