简体   繁体   中英

Compute indices of two vectors' common elements efficiently

I have two vectors (each of them have only unique elements) that share a set of integers. I would like to compute the indices of the elements of one vector that also exist in the other vector as efficiently as possible. Can you outperform my humble inefficient implementation?

Edit: The vectors are not sorted and we need the indices of the unsorted vector. Furthermore, it is forbidden to modify the initial vectors ( random_vec_1 and random_vec_2 ) when solving the problem.

#include <chrono>
#include <iostream>
#include <random>
#include <set>
#include <unordered_set>
#include <vector>

using namespace std::chrono;

int main() {

    // Setup 1: Construct two vectors with random integers.
    constexpr size_t num = 1000;

    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> dis(0, num);

    std::vector<int> random_vec_1;
    std::vector<int> random_vec_2;
    random_vec_1.reserve(num);
    random_vec_2.reserve(num);
    for (size_t i = 0u; i < num; ++i) {
        random_vec_1.push_back(dis(gen));
        random_vec_2.push_back(dis(gen));
    }
    // Setup 2: Make elements unique and shuffle them.
    std::set<int> s1(random_vec_1.begin(), random_vec_1.end());
    std::set<int> s2(random_vec_2.begin(), random_vec_2.end());
    random_vec_1.assign(s1.begin(), s1.end());
    random_vec_2.assign(s2.begin(), s2.end());
    std::random_shuffle(random_vec_1.begin(), random_vec_1.end());
    std::random_shuffle(random_vec_2.begin(), random_vec_2.end());


    std::cout << "size random_vec_1: " << random_vec_1.size() << "\n";
    std::cout << "size random_vec_2: " << random_vec_2.size() << "\n";

    auto begin1 = high_resolution_clock::now();

    // Solve problem -------------------------------------------
    std::vector<size_t> match_index_2;
    std::unordered_set<int> my_set(random_vec_1.begin(), random_vec_1.end());
    for (size_t i = 0u; i < random_vec_2.size(); ++i) {
        if (my_set.count(random_vec_2[i]) == 1u)
            match_index_2.push_back(i);
    }
    // ---------------------------------------------------------

    auto end1 = high_resolution_clock::now();
    auto ticks1 = duration_cast<microseconds>(end1-begin1);
    std::cout << "Set approach took " << ticks1.count() << " microseconds.\n";
    std::cout << "Number of common indices: " << match_index_2.size() << "\n";

}

vector are so fast nowadays that I would not use a set :

  1. Copy the first vector to eg new_vector_1 ;
  2. Sort new_vector_1 ;
  3. Use binary_search to find value in new_vector_1 .

Code:

std::vector<int> new_vec_1(random_vec_1);
std::sort(std::begin(new_vec_1), std::end(new_vec_1));
std::vector<size_t> match_index_2;
match_index_2.reserve(random_vec_2.size());

for (size_t i = 0; i < random_vec_2.size(); ++i) {
    if (std::binary_search(std::begin(new_vec_1), 
                           std::end(new_vec_1),
                           random_vec_2[i])) {
        match_index_2.push_back(i);
    }
}

See code on ideone - Code is twice as fast as the set version, and I think it may be optimize further.

Note that this code is algorithmically equivalent to yours, but std::vector are so fast you get better performance.


Here is another approach that sort both vectors (but is a bit faster):

std::vector<int> new_vec_1(random_vec_1);
std::vector<int> new_vec_2(random_vec_2);
std::sort(std::begin(new_vec_1), std::end(new_vec_1));
std::sort(std::begin(new_vec_2), std::end(new_vec_2));
std::vector<size_t> match_index_2;
match_index_2.reserve(random_vec_2.size());

for (auto it1 = new_vec_1.begin(), it2 = new_vec_2.begin();
     it1 != new_vec_1.end() && it2 != new_vec_2.end();
     ++it2) {
    while (it1 != new_vec_1.end() && *it1 < *it2) ++it1;
    if (it1 != new_vec_1.end() && *it1 == *it2) {
        match_index_2.push_back(it2 - new_vec_2.begin());
    }
}

New answer

The new requirement is that the original vectors cannot be modified when computing the solution. The sorting-intersection solution does not work anymore since the indices are mixed up.

Here is what I suggest : mapping the first vector values to the corresponding indices with an unordered_map , and then running through the second vector values.

// Not necessary, might increase performance
match_index_2.reserve(std::min(random_vec_1.size(), random_vec_2.size()));

std::unordered_map<int, int> index_map;
// random_vec_2 is the one from which we want the indices.
index_map.reserve(random_vec_2.size());
for (std::size_t i = 0; i < random_vec_2.size(); ++i) {
    index_map.emplace(random_vec_2[i], i);
}

for (auto& it : random_vec_1) {
    auto found_it = index_map.find(it);
    if (found_it != index_map.end()) {
        match_index_2.push_back(found_it->second);
    }
}

Also, if the values in your vectors are inside a relatively small range (which is what user2079303 asked you), you can replace the map with a vector, which might further increase performance. In the following, I assume the values are inside the range [0, num].

match_index_2.reserve(std::min(random_vec_1.size(), random_vec_2.size()));

constexpr std::size_t unmapped = -1; // -1 or another unused index
// Since std::size_t is an unsigned type, -1 will actually be the maximum value it can hold.

std::vector<std::size_t> index_map(num, unmapped);
for (std::size_t i = 0; i < random_vec_2.size(); ++i) {
    index_map[random_vec_2[i]] = i;
}

for (auto& it : random_vec_1) {
    auto index = index_map[it];
    if (index != unmapped) {
        match_index_2.push_back(index);
    }
}

Previous answer

Since your vectors are already sorted (after using std::set to keep unique elements), you can use this algorithm :

auto first1 = random_vec_1.begin();
auto last1 = random_vec_1.end();
auto first2 = random_vec_2.begin();
auto last2 = random_vec_2.end();
auto index_offset = first1; // Put first2 if you want the indices of the second vector instead

while (first1 != last1 && first2 != last2)
    if (*first1 < *first2)
        ++first1;
    else if (*first2 < *first1)
        ++first2;
    else {
        match_index_2.push_back(std::distance(index_offset, first1));
        ++first1;
        ++first2;
    }
}

Adapted from the gcc libstdc++ source code for std::set_intersection .

Here is another version, adapted from cppreference :

auto first1 = random_vec_1.begin();
auto last1 = random_vec_1.end();
auto first2 = random_vec_2.begin();
auto last2 = random_vec_2.end();
auto index_offset = first1; // Put first2 if you want the indices of the second vector instead

while (first1 != last1 && first2 != last2) {
    if (*first1 < *first2) {
        ++first1;
    } else  {
        if (!(*first2 < *first1)) {
            match_index_2.push_back(std::distance(index_offset, first1++));
        }
        ++first2;
    }
}

If you want more efficiency, call reserve on match_index_2 before. Also, you can get rid of sets by using std::sort and std::unique instead.

// Setup 2: Make elements unique.
auto first1 = random_vec_1.begin();
auto last1 = random_vec_1.end();
std::sort(first1, last1);
last1 = std::unique(first1, last1);
random_vec_1.erase(last1, random_vec_1.end());

auto first2 = random_vec_2.begin();
auto last2 = random_vec_2.end();
std::sort(first2, last2);
last2 = std::unique(first2, last2);
random_vec_2.erase(last2, random_vec_2.end());

You might create indices into the sets of values and operate on these:

#include <algorithm>
#include <vector>

inline std::vector<std::size_t>  make_unique_sorted_index(const std::vector<int>& v) {
    std::vector<std::size_t> result(v.size());
    std::iota(result.begin(), result.end(), 0);
    std::sort(result.begin(), result.end(),
        [&v] (std::size_t a, std::size_t b) {
            return v[a] < v[b];
    });
    auto obsolete = std::unique(result.begin(), result.end(),
        [&v] (std::size_t a, std::size_t b) {
            return v[a] == v[b];
    });
    result.erase(obsolete, result.end());
    return result;
}

// Constructs an unordered range of indices [i0, i1, i2, ...iN) into the first set
// for elements that are found uniquely in both sets.
// Note: The sequence [set1[i0], set1[i1], set1[i2], ... set1[iN]) will be sorted.
std::vector<std::size_t>  unordered_set_intersection(
    const std::vector<int>& set1,
    const std::vector<int>& set2)
{
    std::vector<std::size_t> result;
    result.reserve(std::min(set1.size(), set2.size()));
    std::vector<std::size_t> index1 = make_unique_sorted_index(set1);
    std::vector<std::size_t> index2 = make_unique_sorted_index(set2);

    auto i1 = index1.begin();
    auto i2 = index2.begin();
    while(i1 != index1.end() && i2 != index2.end()) {
        if(set1[*i1] < set2[*i2]) ++i1;
        else if(set2[*i2] < set1[*i1]) ++i2;
        else {
            result.push_back(*i1);
            ++i1;
            ++i2;
        }
    }
    result.shrink_to_fit();
    return result;
}

Note: An improvement in performance might be gained by skipping the second index and operating on a copy of the second set.

Alternatively, make_unique_sorted_index might be replaced by:

inline std::vector<std::size_t>  make_sorted_index(const std::vector<int>& v) {
    std::vector<std::size_t> result(v.size());
    std::iota(result.begin(), result.end(), 0);
    std::sort(result.begin(), result.end(),
        [&v] (std::size_t a, std::size_t b) {
            return v[a] < v[b];
    });
    return result;
}

The algorithm produces stable results if the indices are unique or not:

  • The sorting of elements (the result indices pointing to) is as stable as std::sort.
  • If the indices are not unique, the number of identical elements (the result indices pointing to) is the minimum number of identical elements in the first or second set, respectively.

In reality I would expect sorting the vectors to substantially outperform the creation of the std::set , because the STL-set is a tree, and a vector of int can be sorted in linear time using counting sort, which, if you don't count beyond one, will give you a set. Creating the set is O(n log n) for n insertions of cost log n, whereas the sort is O(n), as mentioned.

On the sorted vector, you can then run std::set_difference , which should also run in time linear to the larger of the two inputs.

Thus you should be able to do this in linear time.

If you cannot modify the vector you can use a hashmap (std::unordered_map) to map values to indices in the original vector. Note that since you did not mention the numbers being unique, you would find a result such as the values x_1,...,x_n are contained in both sets, and then you would use the map to project that back to indices in your original vector using the hashmap.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM