简体   繁体   English

将并行程序从openMP转换为openCL

[英]Converting parallel program from openMP to openCL

I just wonder how to convert the following openMP program to a openCL program. 我只是想知道如何将以下openMP程序转换为openCL程序。

The parallel section of algorithm implemented using openMP looks like this: 使用openMP实现的算法的并行部分如下所示:

#pragma omp parallel
  {
    int thread_id = omp_get_thread_num();

    //double mt_probThreshold = mt_nProbThreshold_;
    double mt_probThreshold = nProbThreshold;

    int mt_nMaxCandidate = mt_nMaxCandidate_;
    double mt_nMinProb = mt_nMinProb_;

    int has_next = 1;
    std::list<ScrBox3d> mt_detected;
    ScrBox3d  sample;
    while(has_next) {
#pragma omp critical
    {  // '{' is very important and define the block of code that needs lock.
      // Don't remove this pair of '{' and '}'.
      if(piter_ == box_.end()) {
        has_next = 0;
      } else{
        sample = *piter_;
        ++piter_;
      }
    }  // '}' is very important and define the block of code that needs lock.

    if(has_next){
      this->SetSample(&sample, thread_id);
      //UpdateSample(sample, thread_id); // May be necesssary for more sophisticated features
      sample._prob = (float)this->Prob( true, thread_id, mt_probThreshold);
      //sample._prob = (float)_clf->LogLikelihood( thread_id);
      InsertCandidate( mt_detected, sample, mt_probThreshold, mt_nMaxCandidate, mt_nMinProb );
    }
  }

#pragma omp critical
  {  // '{' is very important and define the block of code that needs lock.
    // Don't remove this pair of '{' and '}'.
    if(mt_detected_.size()==0) {
      mt_detected_    = mt_detected;
      //mt_nProbThreshold_  = mt_probThreshold;
      nProbThreshold = mt_probThreshold;
    } else {
      for(std::list<ScrBox3d>::iterator it = mt_detected.begin(); 
          it!=mt_detected.end(); ++it)
        InsertCandidate( mt_detected_, *it, /*mt_nProbThreshold_*/nProbThreshold, 
        mt_nMaxCandidate_, mt_nMinProb_ );
      }
    }  // '}' is very important and define the block of code that needs lock.
  }//parallel section end

My question is: can this section be implemented with openCL? 我的问题是:这部分可以用openCL实现吗? I followed a series of openCL tutorials, and I understood the manner of work, I was writing the code in .cu files, (I previously installed CUDA toolkit) but in this case the situation is more complicated, because there are used a lot of header files, template classes and object-oriented-programming were used. 我按照了一系列openCL教程,我理解了工作方式,我在.cu文件中编写代码,(我以前安装过CUDA工具包),但在这种情况下情况比较复杂,因为有很多使用头文件,模板类和面向对象编程。

How could I convert this section implemented in openMP to openCL? 如何将openMP中实现的此部分转换为openCL? Should I create a new .cu file? 我应该创建一个新的.cu文件吗?

Any advice could help. 任何建议都有帮助。 Thanks in advance. 提前致谢。

Edit: 编辑:

Using VS profiler I noticed that the most execution time is spent on InsertCandidate() function, I'm thinking about writing a kernel to execute this function on GPU. 使用VS profiler我发现在InsertCandidate()函数上花费的执行时间最多,我正在考虑编写一个内核来在GPU上执行这个函数。 The most expensive operation of this function is a for instruction. 这个函数最昂贵的操作是for指令。 But as it can be seen, each for cycle contains 3 if instructions, and this can lead to divergence, resulting in serialization, even if executed on GPU. 但随着可以看出,每次循环包含3 if指令,而这会导致分歧,导致系列化,即使在GPU上执行。

for( iter = detected.begin(); iter != detected.end(); iter++ )
    {
        if( nCandidate == nMaxCandidate-1 )
            nProbThreshold = iter->_prob;

        if( box._prob >= iter->_prob )
            break;
        if( nCandidate >= nMaxCandidate && box._prob <= nMinProb )
            break;
        nCandidate ++;
    }

As a conclusion, can this program be converted to openCL? 作为结论,这个程序可以转换为openCL吗?

It may be possible to convert your sample code to opencl, however I spotted a couple of issues with doing so. 有可能将您的示例代码转换为opencl,但是我发现了这样做的几个问题。

  1. There doesn't seem to be much parallel execution to begin with. 开始时似乎没有太多的并行执行。 More workers may not help at all. 更多工人可能根本没有帮助。
  2. Adding work to process during execution is a fairly recent feature in opencl. 在执行期间添加工作是opencl中最近的一项功能。 You would have to either use opencl 2.0, or know in advance how much work will be added, and pre-allocate memory to store the new data structures. 您必须使用opencl 2.0,或者事先知道将添加多少工作,并预先分配内存以存储新的数据结构。 The calls to InsertCandidate may be the part which "can't" be converted to opencl. 对InsertCandidate的调用可能是“无法”转换为opencl的部分。

If the function is large enough, you may be able to port the calls to this->Prob(...) instead. 如果函数足够大,您可以将调用移植到this-> Prob(...)。 You need to be able to cache up a bunch of calls' by storing the parameters in a suitable data structure. 通过将参数存储在合适的数据结构中,您需要能够缓存一堆调用。 By 'a bunch' I mean at least hundreds but ideally thousands or more. “一堆”我的意思是至少数百,但理想情况下数千或更多。 Again, this is only worth it if this->Prob() is constant for all calls, and complex enough to be worth the round-trip to the opencl device and back. 同样,如果this-> Prob()对于所有调用都是常量的,并且复杂到足以值得往返于opencl设备并返回,那么这是值得的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM