简体   繁体   中英

Matlab matrix multiplication speed

I was wondering how can matlab multiply two matrices so fast. When multiplying two NxN matrices, N^3 multiplications are performed. Even with the Strassen Algorithm it takes N^2.8 multiplications, which is still a large number. I was running the following test program:

a = rand(2160);
b = rand(2160);
tic;a*b;toc

2160 was used because 2160^3=~10^10 ( a*b should be about 10^10 multiplications)

I got:

Elapsed time is 1.164289 seconds.

(I'm running on 2.4Ghz notebook and no threading occurs) which mean my computer made ~10^10 operation in a little more than 1 second.

How this could be??

It's a combination of several things:

  • Matlab does indeed multi-thread.
  • The core is heavily optimized with vector instructions.

Here's the numbers on my machine: Core i7 920 @ 3.5 GHz (4 cores)

>> a = rand(10000);
>> b = rand(10000);
>> tic;a*b;toc
Elapsed time is 52.624931 seconds.

Task Manager shows 4 cores of CPU usage.

Now for some math:

Number of multiplies = 10000^3 = 1,000,000,000,000 = 10^12

Max multiplies in 53 secs =
    (3.5 GHz) * (4 cores) * (2 mul/cycle via SSE) * (52.6 secs) = 1.47 * 10^12

So Matlab is achieving about 1 / 1.47 = 68% efficiency of the maximum possible CPU throughput.

I see nothing out of the ordinary.

To check whether you do or not use multi-threading in MATLAB use this command

maxNumCompThreads(n)

This sets the number of cores to use to n. Now I have a Core i7-2620M, which has a maximum frequency of 2.7GHz, but it also has a turbo mode with 3.4GHz . The CPU has two cores. Let's see:

A = rand(5000);
B = rand(5000);
maxNumCompThreads(1);
tic; C=A*B; toc
Elapsed time is 10.167093 seconds.

maxNumCompThreads(2);
tic; C=A*B; toc
Elapsed time is 5.864663 seconds.

So there is multi-threading.

Let's look at the single CPU results. A*B executes approximately 5000^3 multiplications and additions. So the performance of single-threaded code is

5000^3*2/10.8 = 23 GFLOP/s

Now the CPU. 3.4 GHz, and Sandy Bridge can do maximum 8 FLOPs per cycle with AVX:

3.4 [Ginstructions/second] * 8 [FLOPs/instruction] = 27.2 GFLOP/s peak performance

So single core performance is around 85% peak, which is to be expected for this problem.

You really need to look deeply into the capabilities of your CPU to get accurate performannce estimates.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM