使用openMP的for循环运行速度比串行代码慢

Question

I have a piece of code which I am trying to run in parallel, however, for some reason it does not increase speed. 我有一段代码试图并行运行，但是由于某种原因，它并没有提高速度。

The code performs matching between a set of newly found keyPoints and older found keyPoints and thereafter RANSAC is performed in 500 iterations. 该代码在一组新找到的关键点和较早找到的关键点之间执行匹配，然后在500次迭代中执行RANSAC。

Since each pair works independent of each other, I would expect that doing the matching and RANSAC would increase speed quite a lot. 由于每对彼此独立工作，因此我希望进行匹配和RANSAC可以大大提高速度。

Here is the code: 这是代码：

Eigen::VectorXf RigidChainTracker::GPUgetCamera(cv::Mat depthNew, cv::Mat colorNew, cv::Mat grayNew, deque<imgPair> oldImg, const Eigen::VectorXf &XiInit,
    vector<cv::KeyPoint> foundPtsNew, cv::Mat descriptorNew, const float alpha){

    vector<cv::KeyPoint> matchedOld[CHAINLENGTH];
    vector<cv::KeyPoint> matchedNew[CHAINLENGTH];

    FindFeatures::FindFeatures fFeatures[CHAINLENGTH];
    for (int i = 0; i < CHAINLENGTH; ++i){
        fFeatures[i] = FindFeatures::FindFeatures(METHOD, fx, fy, cx, cy);
    }
    double t1 = omp_get_wtime(); 
    int i = 0;
    int chainlength = CHAINLENGTH;
    vector<cv::KeyPoint> keyPtsNew[CHAINLENGTH];
    vector<cv::KeyPoint> keyPtsOld[CHAINLENGTH]; 
    const float THRESHOLD = 0.02;
    Eigen::Matrix3f Rrel[CHAINLENGTH];
    Eigen::Vector3f trel[CHAINLENGTH];

#pragma omp parallel for private (i) shared(fFeatures,matchedNew, matchedOld, depthNew, colorNew, grayNew, oldImg,descriptorNew,foundPtsNew, keyPtsNew, keyPtsOld,Rrel,trel, chainlength, THRESHOLD) 
    for (i = 0; i < chainlength; ++i){
        if (i < oldImg.size()){
            fFeatures[i].MatchFeatures(depthNew, oldImg[i].depth, descriptorNew, oldImg[i].descriptor, foundPtsNew, oldImg[i].foundPts, keyPtsNew[i], keyPtsOld[i]);
            fFeatures[i].RANSAC3D(depthNew, oldImg[i].depth, keyPtsNew[i], keyPtsOld[i],  Rrel[i], trel[i],THRESHOLD);


            matchedNew[i] = keyPtsNew[i]; 
            matchedOld[i] = keyPtsOld[i];
        } 
    }

When running this in serial, it runs in about 2-5 Hz, but with OpenMp slightly slower. 串行运行时，它的运行频率约为2-5 Hz，但OpenMp的运行速度稍慢。 I have tried some different things, but I cannot get it right. 我尝试了一些不同的方法，但是我做对了。 Can it be something strange when trying to read from the same memory, like when reading from depthNew or descriptorNew. 尝试从同一内存读取时（例如从depthNew或描述符New读取时），可能会有些奇怪吗？ I write information to keyPtsNew, keyPtsOld, matchedNew, matchedOld, Rrel, and trel in MatchFeatures and RANSAC3D. 我将信息写入MatchFeatures和RANSAC3D中的keyPtsNew，keyPtsOld，matchedNew，matchedOld，Rrel和trel中。 From the images depthNew and descriptorNew, I only read information. 从图像depthNew和描述符New中，我仅读取信息。 Is it really possible that serial code is faster? 串行代码真的有可能更快吗？

I have confirmed that multiple threads are executing and the openMP flag in visual studio is enabled. 我已经确认正在执行多个线程，并且在Visual Studio中启用了openMP标志。 :) :)

I have tried the suggestion by Avi Ginsburg, and it speeds up the parallel part a bit, but the serial code is still faster. 我尝试了Avi Ginsburg的建议，它可以加快并行部分的速度，但是串行代码仍然更快。

I timed the tow different functions MatchFeatures and RANSAC3D. 我给两个计时功能分别计时MatchFeatures和RANSAC3D。 When running in serial, each task takes about 0.05 seconds at most. 串行运行时，每个任务最多花费大约0.05秒。

When running in parallel, each task needs between 0.1 and 0.15 seconds, considerably slower. 并行运行时，每个任务需要0.1到0.15秒之间的时间，这要慢得多。 I am trying to figure out if there is some parallelization performed by OpenCV that I do not know about. 我试图弄清楚OpenCV是否执行了一些我不知道的并行化。 For example in the matching process or whatever. 例如在匹配过程中。

/ Erik /埃里克

Answer 1

Erik, it is POSSIBLE that a serial code would run faster than a parallel one, due to CACHE and pipelining optimizations performed by the compiler. 埃里克（Erik），由于编译器执行的CACHE和流水线优化，串行代码的运行速度可能比并行代码快。

In your case it seems to be the case due to the large number of variables shared between threads. 您的情况似乎是这样，因为线程之间共享大量变量。 Shared variables have an synchronization overhead that compromise your performance. 共享变量的同步开销会影响您的性能。 I believe it can be fixed by breaking the OMP loop into two sections. 我相信可以通过将OMP循环分为两个部分来解决。 The first one performs the calculations and creates a new array of results. 第一个执行计算并创建一个新的结果数组。 The second one loops through the old and new array performing the match. 第二个循环遍历执行匹配的新旧阵列。

In parallel programming, the time costs of transferring values from memory to CACHE and registers are often so big due to device latency and band limitations, that it may be worth it to simply copy data to CACHE of each thread, compute it, and then reduce the result, instead of continuously sharing it among threads and processes to avoid duplication. 在并行编程中，由于设备延迟和频带限制，将值从内存传输到CACHE和寄存器的时间成本通常非常大，以至于简单地将数据复制到每个线程的CACHE，进行计算然后减少，可能是值得的结果，而不是在线程和进程之间持续共享它以避免重复。 It means that less-memory-optimized code sections might actually improve performance. 这意味着内存较少优化的代码节实际上可以提高性能。

Answer 2

You have what appears to be a case of false sharing . 您似乎有一种虚假共享的情况。 In short, that means that one threads cached objects are becoming invalid by another caches object on the same cache line. 简而言之，这意味着一个线程缓存的对象被同一缓存行上的另一个缓存对象变为无效。 The simplest way to solve this (IMO) is to rewrite the function as: 解决此问题（IMO）的最简单方法是将函数重写为：

Eigen::VectorXf RigidChainTracker::GPUgetCamera(cv::Mat depthNew, cv::Mat colorNew, cv::Mat grayNew, deque<imgPair> oldImg, const Eigen::VectorXf &XiInit,
    vector<cv::KeyPoint> foundPtsNew, cv::Mat descriptorNew, const float alpha)
    {


    vector<cv::KeyPoint> matchedOld[CHAINLENGTH];
    vector<cv::KeyPoint> matchedNew[CHAINLENGTH];

    // Use thread local variables to avoid conflicts
    #pragma omp parallel shared(depthNew, colorNew, grayNew, oldImg,descriptorNew,foundPtsNew) 
    {
        vector<cv::KeyPoint> matchedOld_local[CHAINLENGTH];
        vector<cv::KeyPoint> matchedNew_local[CHAINLENGTH];

        FindFeatures::FindFeatures fFeatures[CHAINLENGTH];
        for (int i = 0; i < CHAINLENGTH; ++i){
            fFeatures[i] = FindFeatures::FindFeatures(METHOD, fx, fy, cx, cy);
        }
        double t1 = omp_get_wtime(); 
        int i = 0;
        int chainlength = CHAINLENGTH;
        vector<cv::KeyPoint> keyPtsNew[CHAINLENGTH];
        vector<cv::KeyPoint> keyPtsOld[CHAINLENGTH]; 
        const float THRESHOLD = 0.02;
        Eigen::Matrix3f Rrel[CHAINLENGTH];
        Eigen::Vector3f trel[CHAINLENGTH];

    #pragma omp for 
        for (int i = 0; i < chainlength; ++i)
        {
            if (i < oldImg.size())
            {
                fFeatures[i].MatchFeatures(depthNew, oldImg[i].depth, descriptorNew, oldImg[i].descriptor, foundPtsNew, oldImg[i].foundPts, keyPtsNew[i], keyPtsOld[i]);
                fFeatures[i].RANSAC3D(depthNew, oldImg[i].depth, keyPtsNew[i], keyPtsOld[i],  Rrel[i], trel[i],THRESHOLD);


                matchedNew_local[i] = keyPtsNew[i]; 
                matchedOld_local[i] = keyPtsOld[i];
            } 
        }

    // Copy to the global variable here
    #pragma omp critical
        {
           for (int i = 0; i < chainlength; ++i)
            {
                if (i < oldImg.size())
                {
                   matchedNew[i] = matchedNew_local[i]; 
                   matchedOld[i] = matchedOld_local[i];
               }
           }            
        }
    }

Note that if any of your shared variables are changed in the for loop, all the threads have to discard their cached version until the updated version is fetched. 请注意，如果在for循环中更改了任何共享变量，则所有线程都必须放弃其缓存版本，直到获取更新版本为止。

Also, as the only changed parameters that I saw were matchedNew and matchedOld , those were the only ones that I declared as thread local vs. global. 另外，由于我看到的唯一更改的参数是matchedNew和matchedOld ，所以它们是我声明为局部线程与全局线程的唯一参数。 If there are others ( fFeatures ?) that need to be used by the master thread later or returned, declare local/global versions of them and copy them in the critical section. 如果还有其他（ fFeatures ？）稍后需要由主线程使用或返回，则声明它们的本地/全局版本，并将其复制到关键部分。

使用openMP的for循环运行速度比串行代码慢

问题描述

2 个解决方案

解决方案1
2 2015-09-18 18:25:11

解决方案2
1 2015-09-20 08:32:51

使用openMP的for循环运行速度比串行代码慢

问题描述

2 个解决方案

解决方案1 2 2015-09-18 18:25:11

解决方案2 1 2015-09-20 08:32:51

解决方案1
2 2015-09-18 18:25:11

解决方案2
1 2015-09-20 08:32:51