简体   繁体   中英

Finding the K UNIQUE largest elements in an unsorted array of pairs

So here is the scenario. I have an unsorted array (very large) called gallery which contains pairs of templates ( std::vector<uint8_t> ) and their associated IDs ( std::string ).

I have a function in which I am provided with a template and must return the IDs of the k most similar templates in my gallery (I am using cosine similarity to generate a similarity score between the templates).

I considered using a heap as discussed in this post . However, the issue is that gallery can contain multiple different templates which belong to a single ID. In my function, I must return k unique IDs.

For context, I am doing a facial recognition application. I can have multiple different templates belonging to a single person in my gallery (the person was enrolled in the gallery using multiple different images, hence multiple templates belonging to their ID). The search function should return the k most similar people to the provided template (and thus not return the same ID more than once).

Would appreciate an efficient algorithm for doing so in C++.

Edit : Code snipped for my proposed solution with heap (does not deal with duplicates properly)

    std::priority_queue<std::pair<double, std::string>, std::vector<std::pair<double, std::string> >, std::greater<> > queue;


    for(const auto& templPair : m_gallery) {
        try{
            double similairty = computeSimilarityScore(templPair.templ, idTemplateDeserial);

            if (queue.size() < candidateListLength) {
                queue.push(std::pair<double, std::string>(similairty, templPair.id));
            } else if (queue.top().first < similairty) {
                queue.pop();
                queue.push(std::pair<double, std::string>(similairty, templPair.id));
            }
        } catch(...) {
            std::cout << "Unable to compute similarity\n";
            continue;
        }
    }
// CandidateListLength number of IDs with the highest scores will be in queue

Here is an example to hopefully help. For the sake of simplicity, I will assume that the similarity score has already been computed for the templates.

Template 1: similarity score: 0.4, ID: Cyrus

Template 2: similarity score: 0.5, ID: James

Template 3: similarity score: 0.9, ID: Bob

Template 4: similarity score: 0.8, ID: Cyrus

Template 5: similarity score: 0.7, ID: Vanessa

Template 6: similarity score: 0.3, ID: Ariana

Getting the IDs of the top 3 scoring templates will return [Bob, Cyrus, Vanessa]

Use std::set structure (balanced BST) instead of heap. It also puts elements in order, lets you find the largest and the smallest element inserted. In addition, it automatically detects a duplicate when using insert function and ignores it, so each element inside will always be unique. Complexity is exactly the same (it is a bit slower though because of a larger constant).

Edit: I probably did not understand the question properly. As far as I can see you can have multiple elements with different values which should be considered a duplicate.

What I would do:

  1. Make a set with pairs (template value, ID)
  2. Make a map where key is an ID and value is a template value of a template currently in the set.

  3. If you want to add a new template:

    • If it's ID is in the map - you have found a duplicate. If it has worse value than the one paired with the ID in the map, do nothing, otherwise delete a pair (old value, ID) from the set and insert (new value, ID), change value in the map to the new one.
    • If it's not in the map just add it to both map and set.
  4. When you have too many items in the set, just delete the worst one from both set and map.

Implemented the answer outline by Maras. It seems to do the job.

#include <iostream>
#include <vector>
#include <map>
#include <utility>
#include <string>
#include <set>

int main() {
    int K = 3;

    std::vector<std::pair<double, std::string>> data {
        {0.4, "Cyrus"},
        {0.5, "James"},
        {0.9, "Bob"},
        {0.8, "Cyrus"},
        {0.7, "Vanessa"},
        {0.3, "Ariana"},
    };

    std::set<std::pair<double, std::string>> mySet;
    std::map<std::string, double> myMap;

    for (const auto& pair: data) {
        if (myMap.find( pair.second ) == myMap.end()) {
            // The ID is unique
            if (mySet.size() < K) {
                // The size of the set is less than the size of search candidates
                // Add the result to the map and the set
                mySet.insert(pair);
                myMap[pair.second] = pair.first;
            } else {
                // Check to see if the current score is larger than the worst performer in the set
                auto worstPairPtr = mySet.begin();

                if (pair.first > (*worstPairPtr).first) {
                    // The contender performed better than the worst in the set
                    // Remove the worst item from the set, and add the contender
                    // Remove the corresponding item from the map, and add the new contender
                    mySet.erase(worstPairPtr);
                    myMap.erase((*worstPairPtr).second);
                    mySet.insert(pair);
                    myMap[pair.second] = pair.first;
                }
            }

        } else {
            // The ID already exists
            // Compare the contender score to the score of the existing ID.
            // If the contender score is better, replace the existing item score with the new score
            // Remove the old item from the set
            if (pair.first > myMap[pair.second]) {
                mySet.erase({myMap[pair.second], pair.second});
                mySet.insert(pair);
                myMap[pair.second] = pair.first;
            }

        }
    }

    for (auto it = mySet.rbegin(); it != mySet.rend(); ++it) {
        std::cout << (*it).second << std::endl;
    }

}

The output is

Bob
Cyrus
Vanessa

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM