如何使用 OpenMP 并行化最近邻搜索

Question

Basically, I have a collection std::vector<std::pair<std::vector<float>, unsigned int>> which contains pairs of templates std::vector<float> of size 512 (2048 bytes ) and their corresponding identifier unsigned int .基本上，我有一个集合std::vector<std::pair<std::vector<float>, unsigned int>>其中包含大小为 512（2048 bytes ）的模板std::vector<float>及其对应的模板对标识符unsigned int 。

I am writing a function in which I am provided with a template and I need to return the identifier of the most similar template in the collection.我正在编写一个函数，其中提供了一个模板，我需要返回集合中最相似模板的标识符。 I am using dot product to compute the similarity.我正在使用点积来计算相似度。

My naive implementation looks as follows:我的天真实现如下所示：

// Should return false if no match is found (ie. similarity is 0 for all templates in collection)
bool identify(const float* data, unsigned int length, unsigned int& label, float& similarity) {
    bool found = false;
    similarity = 0.f;

    for (size_t i = 0; i < collection.size(); ++i) {
        const float* candidateTemplate = collection[i].first.data();
        float consinSimilarity = getSimilarity(data, candidateTemplate, length); // computes cosin sim between two vectors, implementation depends on architecture. 

        if (consinSimilarity > similarity) {
            found = true;
            similarity = consinSimilarity;
            label = collection[i].second;
        }
    }

    return found;
}

How can I speed this up using parallelization.如何使用并行化加快速度。 My collection can contain potentially millions of templates.我的收藏可能包含数百万个模板。 I have read that you can add #pragma omp parallel for reduction but I am not entirely sure how to use it (and if this is even the best option).我已经读到您可以添加#pragma omp parallel for reduction但我不完全确定如何使用它（如果这甚至是最好的选择）。

Also note: For my dot product implementation, if the base architecture supports AVX & FMA, I am using this implementation.另请注意：对于我的点积实现，如果基础架构支持 AVX 和 FMA，我将使用此实现。 Will this affect performance when we parallelize since there are only a limited number of SIMD registers?由于只有有限数量的 SIMD 寄存器，这会影响我们并行化时的性能吗？

Answer 1

Since we don't have access to an example that actually compiles (which would have been nice), I didn't actually try to compile the example below.由于我们无法访问实际编译的示例（这本来很好），我实际上并没有尝试编译下面的示例。 Nevertheless, some minor typos (maybe) aside, the general idea should be clear.尽管如此，除了一些小的错别字（可能）之外，总体思路应该是清楚的。

The task is to find the highest value of similarity and the corresponding label, for this we can indeed use reduction , but since we need to find the maximum of one value and then store the corresponding label, we make use of a pair to store both values at once, in order to implement this as a reduction in OpenMP.任务是找到相似度的最大值和对应的标签，为此我们确实可以使用reduction ，但是由于我们需要找到一个值的最大值然后存储相应的标签，我们使用一对来存储两者值，以便将其实现为 OpenMP 的reduction 。

I have slightly rewritten your code, possibly made things a bit harder to read with the original naming ( temp ) of the variable.我稍微重写了您的代码，可能会使使用变量的原始命名 ( temp ) 更难阅读。 Basically, we perform the search in parallel, so each thread finds an optimal value, we then ask OpenMP to find the optimal solution between the threads ( reduction ) and we are done.基本上，我们并行执行搜索，因此每个线程找到一个最佳值，然后我们要求 OpenMP 找到线程之间的最佳解决方案（ reduction ），我们就完成了。

//Reduce by finding the maximum and also storing the corresponding label, this is why we use a std::pair. 
void reduce_custom (std::pair<float, unsigned int>& output, std::pair<float, unsigned int>& input) {
    if (input.first > output.first) output = input;
}
//Declare an OpenMP reduction with our pair and our custom reduction function. 
#pragma omp declare reduction(custom_reduction : \
    std::pair<float, unsigned int>: \
    reduce_custom(omp_out, omp_in)) \
    initializer(omp_priv(omp_orig))

bool identify(const float* data, unsigned int length, unsigned int& label, float& similarity) {
    std::pair<float, unsigned int> temp(0.0, label); //Stores thread local similarity and corresponding best label. 

#pragma omp parallel for reduction(custom_reduction:temp)
    for (size_t i = 0; i < collection.size(); ++i) {
        const float* candidateTemplate = collection[i].first.data(); 
        float consinSimilarity = getSimilarity(data, candidateTemplate, length);

        if (consinSimilarity > temp.first) {
            temp.first = consinSimilarity;
            temp.second = collection[i].second;
        }
    }

    if (temp.first > 0.f) {
        similarity = temp.first;
        label = temp.second;
        return true;
    }

    return false;
}

Regarding your concern on the limited number of SIMD registers, their number depends on the specific CPU you are using.关于您对 SIMD 寄存器数量有限的担忧，它们的数量取决于您使用的特定 CPU。 To the best of my understanding each core has a set number of vector registers available, so as long as you were not using more than there were available before it should be fine now as well, besides, AVX512 for instance provides 32 vector registers and 2 arithemtic units for vector operations per core, so running out of compute resources is not trivial, you are more likely to suffer due to poor memory locality (particularly in your case with vectors being saved all over the place).据我所知，每个内核都有一组可用的向量寄存器，所以只要您使用的向量寄存器数量不超过之前可用的数量，现在也应该没问题，此外，例如 AVX512 提供 32 个向量寄存器和 2每个内核的向量操作的算术单元，因此耗尽计算资源并非易事，您更有可能因内存局部性差而受苦（特别是在您将向量保存在所有地方的情况下）。 I might of course be wrong, if so, please feel free to correct me in the comments.我当然可能是错的，如果是这样，请随时在评论中纠正我。

如何使用 OpenMP 并行化最近邻搜索

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-02-03 13:44:15

如何使用 OpenMP 并行化最近邻搜索

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-02-03 13:44:15

解决方案1
2 已采纳 2020-02-03 13:44:15