简体   繁体   中英

How to fast calculate the normalized l1 and l2 norm of a vector in C++?

I have a matrix X that has n column data vectors in d dimensional space. Given a vector xj , v[j] is its l1 norm (the summation of all abs(xji) ), w[j] is the square of its l2 norm (the summation of all xji^2 ), and pj[i] is the combination of entries divided by l1 and l2 norm. Finally, I need the outputs: pj, v, w for subsequet applications.

// X = new double [d*n]; is the input.
double alpha = 0.5;
double *pj = new double[d];
double *x_abs = new double[d];
double *x_2 = new double[d];
double *v = new double[n]();
double *w = new double[n]();
for (unsigned long j=0; j<n; ++j) {
        jm = j*m;
        jd = j*d;
        for (unsigned long i=0; i<d; ++i) {
            x_abs[i] = abs(X[i+jd]);
            v[j] += x_abs[i];
            x_2[i] = x_abs[i]*x_abs[i];
            w[j] += x_2[i];    
        }
        for (unsigned long i=0; i<d; ++i){
            pj[i] = alpha*x_abs[i]/v[j]+(1-alpha)*x_2[i]/w[j];     
        }

   // functionA(pj){ ... ...}  for subsequent applications
} 
// functionB(v, w){ ... ...} for subsequent applications

My above algorithm takes O(nd) Flops/Time-complexity, can any one help me to speed up it by using building-functoin or new implementation in C++? Reducing the constant value in O(nd) is also very helpful for me.

Let me guess: since you have problems related with the performance, the dimension of your vectors is quite large.
If this is the case, then it worth considering "CPU cache locality" - some interesting info on this in a cppcon14 presentation .
If the data is not available in the CPU caches, then abs -ing or squaring it it once available is dwarfed by the time the CPU just wait for the data.

With this is mind, you may want to try the following solution (with no warranties that will improve performance - the compiler may actually apply these techniques when optimizing the code)

for (unsigned long j=0; j<n; ++j) {
        // use pointer arithmetic - at > -O0 the compiler will do it anyway
        double *start=X+j*d, *end=X+(j+1)*d;

        // this part avoid as much as possible the competition
        // on CPU caches between X and v/w.
        // Don't store the norms in v/w as yet, keep them in registers
        double l1norm=0, l2norm=0;
        for(double *src=start; src!=end; src++) {
            double val=*src;
            l1norm+=abs(src);
            l2norm+= src*src;
        }
        double pl1=alpha/l1norm, pl2=(1-alpha)*l2norm;
        for(double *src=start, *dst=pj; src!=end; src++, dst++) {
          // Yes, recomputing abs/sqr may actually save time by not
          // creating competition on CPU caches with x_abs and x_2
          double val=*src;
          *dst = pl1*abs(val) + pl2*val*val;
        }    
        // functionA(pj){ ... ...}  for subsequent applications

        // Think well if you really need v/w. If you really do,
        // at least there are two values to be sent for storage into memory,
        //meanwhile the CPU can actually load the next vector into cache
        v[j]=l1norm; w[j]=l2norm;
}
// functionB(v, w){ ... ...} for subsequent applications

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM