简体   繁体   中英

Optimization of double subtraction in C++

I have the following code that I use to compute the distance between two vectors:

double dist(vector<double> & vecA, vector<double> & vecB){
    double curDist = 0.0;
    for (size_t i = 0; i < vecA.size(); i++){
        double dif = vecA[i] - vecB[i];
        curDist += dif * dif;
    }

    return curDist;
}

This function is a major bottleneck in my application since it relies on a lot of distance calculations, consuming more than 60% of CPU time on a typical input. Additionally, the following line:

double dif = vecA[i] - vecB[i];

is responsible for more than 77% of CPU time in this function. My question is: is it possible to somehow optimize this function?

Notes:

  • To profile my application I have used Intel Amplifier XE;
  • Reducing the number of distance computations is not a feasible solution for me;

There are two possible issues I can think of right now:

  • This computation is memory bound.
  • There is an iteration-to-iteration dependency on curDist .

This computation is memory bound.

Your dataset is larger than your CPU cache. So in this case, no amount of optimization is going to help unless you can restructure your algorithm.


There is an iteration-to-iteration dependency on curDist .

You have a dependency on curDist . This will block vectorization by the compiler. (Also, don't always trust the profiler numbers to the line. They can be inaccurate especially after compiler optimizations.)

Normally, the compiler vectorizer can split up the curDist into multiple partial sums to and unroll/vectorize the loop. But it can't do that under strict-floating-point behavior. You can try relaxing your floating-point mode if you haven't already. Or you can split the sum and unroll it yourself.

For example, this kind of optimization is something the compiler can do with integers, but not necessarily with floating-point :

double curDist0 = 0.0;
double curDist1 = 0.0;
double curDist2 = 0.0;
double curDist3 = 0.0;
for (size_t i = 0; i < vecA.size() - 3; i += 4){
    double dif0 = vecA[i + 0] - vecB[i + 0];
    double dif1 = vecA[i + 1] - vecB[i + 1];
    double dif2 = vecA[i + 2] - vecB[i + 2];
    double dif3 = vecA[i + 3] - vecB[i + 3];
    curDist0 += dif0 * dif0;
    curDist1 += dif1 * dif1;
    curDist2 += dif2 * dif2;
    curDist3 += dif3 * dif3;
}

//  Do some sort of cleanup in case (vecA.size() % 4 != 0)

double curDist = curDist0 + curDist1 + curDist2 + curDist3;

You could eliminate the call to vecA.size() for each iteration of the loop, just call it once before the loop. You could also do loop unrolling to give yourself more computation per loop iteration. What compiler are you using, and what optimization settings? Compiler will often do unrolling for you, but you could manually do it.

If it's feasible (if the range of the numbers isn't huge) you may want to explore using fixed point to store these numbers, rather than doubles.

Fixed point would turn these into int operations rather than double operations.

Another interesting thing is that assuming your profile is correct, the lookups seems to be a significant factor (otherwise the multiplication would likely be more costly than the subtractions).

I'd try using a const vector iterator rather than the random access lookup. It may help in two ways: 1 - it is constant, and 2 - the serial nature of the iterator may let the processor do better caching.

If your platform does not have (or is not using) an ALU that supports floating point math, floating point libraries, by nature, are slow and consume additional non-volatile memory. I suggest instead using 32-bit ( long ) or 64-bit ( long long ) fixed-point arithmetic. Then convert the final result to floating point at the end of the algorithm. I did this on a project a couple years ago to improve the performance of an I2T algorithm and it worked wonderfully.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM