简体   繁体   中英

Where should I put ANNOTATE_ITERATION_TASK?

I'm using Intel Advisor to analyze my parallel application. I have this code, which is the main loop of my program and where is spent most of the time:

   for(size_t i=0; i<wrapperIndexes.size(); i++){
       const int r = wrapperIndexes[i].r;
       const int c = wrapperIndexes[i].c;
       const float val = localWrappers[wrapperIndexes[i].i].cur.at<float>(wrapperIndexes[i].r,wrapperIndexes[i].c);
       if ( (val > positiveThreshold && (isMax(val, localWrappers[wrapperIndexes[i].i].cur, r, c) && isMax(val, localWrappers[wrapperIndexes[i].i].low, r, c) && isMax(val, localWrappers[wrapperIndexes[i].i].high, r, c))) ||
            (val < negativeThreshold && (isMin(val, localWrappers[wrapperIndexes[i].i].cur, r, c) && isMin(val, localWrappers[wrapperIndexes[i].i].low, r, c) && isMin(val, localWrappers[wrapperIndexes[i].i].high, r, c))) )
          // either positive -> local max. or negative -> local min.
            ANNOTATE_ITERATION_TASK(localizeKeypoint);
            localizeKeypoint(r, c, localCurSigma[wrapperIndexes[i].i], localPixelDistances[wrapperIndexes[i].i], localWrappers[wrapperIndexes[i].i]);
   }

As you can see, localizeKeypoint is where most of the time the loop is spent (if you don't consider the if clause). I want to do a Suitability Report to estimate the gain from parallelizing the loop above. So I've written this:

   ANNOTATE_SITE_BEGIN(solve);
   for(size_t i=0; i<wrapperIndexes.size(); i++){
       const int r = wrapperIndexes[i].r;
       const int c = wrapperIndexes[i].c;
       const float val = localWrappers[wrapperIndexes[i].i].cur.at<float>(wrapperIndexes[i].r,wrapperIndexes[i].c);
       if ( (val > positiveThreshold && (isMax(val, localWrappers[wrapperIndexes[i].i].cur, r, c) && isMax(val, localWrappers[wrapperIndexes[i].i].low, r, c) && isMax(val, localWrappers[wrapperIndexes[i].i].high, r, c))) ||
            (val < negativeThreshold && (isMin(val, localWrappers[wrapperIndexes[i].i].cur, r, c) && isMin(val, localWrappers[wrapperIndexes[i].i].low, r, c) && isMin(val, localWrappers[wrapperIndexes[i].i].high, r, c))) )
          // either positive -> local max. or negative -> local min.
            ANNOTATE_ITERATION_TASK(localizeKeypoint);
            localizeKeypoint(r, c, localCurSigma[wrapperIndexes[i].i], localPixelDistances[wrapperIndexes[i].i], localWrappers[wrapperIndexes[i].i]);
   }
   ANNOTATE_SITE_END();

And the Suitability Report given an excellent 6.69x gain, as you can see here:

在此处输入图片说明

However, launching dependencies check, I got this problem message:

在此处输入图片说明

In particular see "Missing start task".

In addition, if I place ANNOTATE_ITERATION_TASK at the beggining of the loop, like this:

   ANNOTATE_SITE_BEGIN(solve);
   for(size_t i=0; i<wrapperIndexes.size(); i++){
        ANNOTATE_ITERATION_TASK(localizeKeypoint);
       const int r = wrapperIndexes[i].r;
       const int c = wrapperIndexes[i].c;
       const float val = localWrappers[wrapperIndexes[i].i].cur.at<float>(wrapperIndexes[i].r,wrapperIndexes[i].c);
       if ( (val > positiveThreshold && (isMax(val, localWrappers[wrapperIndexes[i].i].cur, r, c) && isMax(val, localWrappers[wrapperIndexes[i].i].low, r, c) && isMax(val, localWrappers[wrapperIndexes[i].i].high, r, c))) ||
            (val < negativeThreshold && (isMin(val, localWrappers[wrapperIndexes[i].i].cur, r, c) && isMin(val, localWrappers[wrapperIndexes[i].i].low, r, c) && isMin(val, localWrappers[wrapperIndexes[i].i].high, r, c))) )
          // either positive -> local max. or negative -> local min.
            localizeKeypoint(r, c, localCurSigma[wrapperIndexes[i].i], localPixelDistances[wrapperIndexes[i].i], localWrappers[wrapperIndexes[i].i]);
   }
   ANNOTATE_SITE_END();

The gain is horrible:

在此处输入图片说明

Am I doing something wrong?

INTEL_OPT=-O3 -simd -xCORE-AVX2 -parallel -qopenmp -fargument-noalias -ansi-alias -no-prec-div -fp-model fast=2
INTEL_PROFILE=-g -qopt-report=5 -Bdynamic -shared-intel -debug inline-debug-info -qopenmp-link dynamic -parallel-source-info=2 -ldl 

You have to use second approach, where you put ANNOTATE_ITERATION_TASK at the very beginning of loop annotation. Otherwise you get (a) wrong performance projection in Suitability, (b) Missing Start task in Correctness.

If you run Correctness for the second variant (where you put iteration task at the very beginning of loop body), then Correctness should be OK.

Your second Suitability chart is not horrible. It just says that you have to take care about task chunking (click on the "chunking" link in the tool to learn more about it). Fortunately, in fresh OpenMP chunking is "good enough" by default, see https://software.intel.com/en-us/articles/openmp-loop-scheduling . So in order to see the Advisor projection with chunking ON, you just need to switch ON corresponding check-box and it will not be that bad.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM