简体   繁体   中英

How to parallelize nearest neighbour search using OpenMP

Basically, I have a collection std::vector<std::pair<std::vector<float>, unsigned int>> which contains pairs of templates std::vector<float> of size 512 (2048 bytes ) and their corresponding identifier unsigned int .

I am writing a function in which I am provided with a template and I need to return the identifier of the most similar template in the collection. I am using dot product to compute the similarity.

My naive implementation looks as follows:

// Should return false if no match is found (ie. similarity is 0 for all templates in collection)
bool identify(const float* data, unsigned int length, unsigned int& label, float& similarity) {
    bool found = false;
    similarity = 0.f;

    for (size_t i = 0; i < collection.size(); ++i) {
        const float* candidateTemplate = collection[i].first.data();
        float consinSimilarity = getSimilarity(data, candidateTemplate, length); // computes cosin sim between two vectors, implementation depends on architecture. 

        if (consinSimilarity > similarity) {
            found = true;
            similarity = consinSimilarity;
            label = collection[i].second;
        }
    }

    return found;
}

How can I speed this up using parallelization. My collection can contain potentially millions of templates. I have read that you can add #pragma omp parallel for reduction but I am not entirely sure how to use it (and if this is even the best option).

Also note: For my dot product implementation, if the base architecture supports AVX & FMA, I am using this implementation. Will this affect performance when we parallelize since there are only a limited number of SIMD registers?

Since we don't have access to an example that actually compiles (which would have been nice), I didn't actually try to compile the example below. Nevertheless, some minor typos (maybe) aside, the general idea should be clear.

The task is to find the highest value of similarity and the corresponding label, for this we can indeed use reduction , but since we need to find the maximum of one value and then store the corresponding label, we make use of a pair to store both values at once, in order to implement this as a reduction in OpenMP.

I have slightly rewritten your code, possibly made things a bit harder to read with the original naming ( temp ) of the variable. Basically, we perform the search in parallel, so each thread finds an optimal value, we then ask OpenMP to find the optimal solution between the threads ( reduction ) and we are done.

//Reduce by finding the maximum and also storing the corresponding label, this is why we use a std::pair. 
void reduce_custom (std::pair<float, unsigned int>& output, std::pair<float, unsigned int>& input) {
    if (input.first > output.first) output = input;
}
//Declare an OpenMP reduction with our pair and our custom reduction function. 
#pragma omp declare reduction(custom_reduction : \
    std::pair<float, unsigned int>: \
    reduce_custom(omp_out, omp_in)) \
    initializer(omp_priv(omp_orig))

bool identify(const float* data, unsigned int length, unsigned int& label, float& similarity) {
    std::pair<float, unsigned int> temp(0.0, label); //Stores thread local similarity and corresponding best label. 

#pragma omp parallel for reduction(custom_reduction:temp)
    for (size_t i = 0; i < collection.size(); ++i) {
        const float* candidateTemplate = collection[i].first.data(); 
        float consinSimilarity = getSimilarity(data, candidateTemplate, length);

        if (consinSimilarity > temp.first) {
            temp.first = consinSimilarity;
            temp.second = collection[i].second;
        }
    }

    if (temp.first > 0.f) {
        similarity = temp.first;
        label = temp.second;
        return true;
    }

    return false;
}

Regarding your concern on the limited number of SIMD registers, their number depends on the specific CPU you are using. To the best of my understanding each core has a set number of vector registers available, so as long as you were not using more than there were available before it should be fine now as well, besides, AVX512 for instance provides 32 vector registers and 2 arithemtic units for vector operations per core, so running out of compute resources is not trivial, you are more likely to suffer due to poor memory locality (particularly in your case with vectors being saved all over the place). I might of course be wrong, if so, please feel free to correct me in the comments.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM