简体   繁体   中英

Will matrix multiplication using for loops decrease performance?

Currently I'm working on a program that uses matrices. I came up with this nested loop to multiply two matrices:

// The matrices are 1-dimensional arrays
for (int i = 0; i < 4; i++)
    for (int j = 0; j < 4; j++)
        for (int k = 0; k < 4; k++)
            result[i * 4 + j] += M1[i * 4 + k] * M2[k * 4 + j];

The loop works. My question is: will this loop be slower compared to writing it all out manually like this:

result[0] = M1[0]*M2[0] + M1[1]*M2[4] + M1[2]*M2[8] + M1[3]*M2[12];
result[1] = M1[0]*M2[1] + M1[1]*M2[5] + M1[2]*M2[9] + M1[4]*M2[13];
result[2] = ... etc.

Because in the nested loop, the array positions are calculated and in the second method, they do not.

Thanks.

As with so many things, "it depends", but in this instance I would tend toward the second, expanded form performing just about the same. Any modern compiler will unroll appropriate loops for you, and take care of it.

Two points perhaps worth making:

  1. The second approach is uglier, is more prone to errors and tedious to write/maintain.

  2. This is a nice example of 'premature optimization' (AKA the root of all evil). Do you know if this section is a bottleneck? Is this really the most intensive part of the code? By optimizing so early we incur everything in point #1 for what amounts to a hunch if we haven't bench marked our code.

Your compiler might already do this, take a look at loop unrolling . Let the compiler do the guessing and the heavy work, stick to the clean code, and as always, measure your performance.

I don't think the loop will be slower. You are accessing the memory of the M1 and M2 arrays in the same way in both instances ie . If you want to make the "manual" version faster then use scalar replacement and do the computation on registers eg

 double M1_0 = M1[0];
 double M2_0 = M2[0];
 result[0] = M1_0*M2_0 + ...

but you can use scalar replacement within the loop as well. You can do it if you do blocking and loop unrolling (in fact your triple loop looks like a blocking version of the MMM).

What you are trying to do is to speed up the program by improving locality ie better use of the memory hierarchy and better locality.

Assuming that you are running code on Intel processors or compatible (AMD) you may actually want to switch to assembly language to do heavy matrix computations. Luckily, you have the Intel-IPP library that does the actual work for you using advanced processor technology and selecting what is expected to be the fastest algorithm depending on your processor.

The IPP includes all the necessary matrix computations that you'd possibly need. The only problem you may encounter is the order in which you created your matrices. You may have to re-organize the order to make it easier to use the IPP functions you'd like to use.

Note that in regard to your two code examples, the second one will be faster because you avoid the += operator which is a read / modify / write cycle and that's generally slow (not only that, it requires the result matrix to be all zeroes to start with whereas the second example does not require clearing the output first), although your matrices are likely to fit in the cache... but, the processors are optimized to read input data in sequence (a[0], a 1 , a[2], a[3], ...) and also to write that data back in sequence. If you can write your algorithm to be as close as possible to such a sequence, all the better. Don't get me wrong, I know that matrix multiplications cannot be done in sequence. But if you think of that to do your optimization, you'll achieve better results (ie change the order in which your matrices are saved in memory could be one of them).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM