简体   繁体   English

Fortran / Python / MATLAB中MKL矩阵乘法性能的独特差异

[英]Peculiar difference in MKL matrix multiplication performance between Fortran/Python/MATLAB

I've written a trivial benchmark comparing matrix multiplication performance in three languages - Fortran (using Intel Parallel Studio 2015, compiling with the ifort switches: /O3 /Qopt-prefetch=2 /Qopt-matmul /Qmkl:parallel, this replaces MatMul calls with calls to the Intel MKL library), Python (using the current Anaconda version, including Anaconda Accelerate, which supplies NumPy 1.9.2 linked with the Intel MKL library) and MATLAB R2015a (which, again, does matrix multiplication using the Intel MKL library). 我写了一个简单的基准比较三种语言的矩阵乘法性能 - Fortran(使用Intel Parallel Studio 2015,用ifort开关编译:/ O3 / Qopt-prefetch = 2 / Qopt-matmul / Qmkl:parallel,这取代了MatMul调用调用英特尔MKL库),Python(使用当前的Anaconda版本,包括Anaconda Accelerate,提供与英特尔MKL库链接的NumPy 1.9.2)和MATLAB R2015a(再次使用英特尔MKL库进行矩阵乘法) )。

Seeing as how all three implementations utilize the same Intel MKL library for matrix multiplication, I would expect the results to be virtually identical, especially for matrices that are sufficiently large for function call overhead to become negligible. 看看所有三种实现如何利用相同的英特尔MKL库进行矩阵乘法,我希望结果几乎相同,特别是对于足够大的函数调用开销变得可以忽略不计的矩阵。 However, this is far from the case, while MATLAB and Python display virtually identical performance, Fortran beats both by a factor of 2-3x. 然而,情况远非如此,而MATLAB和Python显示出几乎相同的性能,Fortran比这两倍都要好。 I'd like to understand why. 我想明白为什么。

Here is the code I've used for the Fortran version: 这是我用于Fortran版本的代码:

program MatMulTest

implicit none

integer, parameter :: N = 1024
integer :: i, j, cr, cm
real*8 :: t0, t1, rate
real*8 :: A(N,N), B(N,N), C(N,N)    

call random_seed()
call random_number(A)
call random_number(B)

! First initialize the system_clock
CALL system_clock(count_rate=cr)
CALL system_clock(count_max=cm)
rate = real(cr)
WRITE(*,*) "system_clock rate: ", rate

call cpu_time(t0)
do i = 1, 100, 1
    C=MatMul(A,B)                
end do
call cpu_time(t1)

write(unit=*, fmt="(a24,f10.5,a2)") "Average time spent: ", (t1-t0), "ms"
write(unit=*, fmt="(a24,f10.3)") "First element of C: ", C(1,1)

end program MatMulTest

Do note that if your system clock rate is not 10000 as in my case, you need to modify the timing calculation accordingly to yield milliseconds. 请注意,如果您的系统时钟速率不是10000,则需要相应地修改时序计算以产生毫秒数。

The Python code: Python代码:

import time
import numpy as np

def main(N):
    A = np.random.rand(N,N)
    B = np.random.rand(N,N)
    for i in range(100):
        C = np.dot(A,B)
    print C[0,0]

if __name__ == "__main__":
    N = 1024
    t0 = time.clock()
    main(N)
    t1 = time.clock()
    print "Time elapsed: " + str((t1-t0)*10) + " ms"

And, finally, the MATLAB snippet: 最后,MATLAB片段:

N=1024;
A=rand(N,N); B=rand(N,N);
tic;
for i=1:100
     C=A*B;
end
t=toc;
disp(['Time elapsed: ', num2str(t*10), ' milliseconds'])

On my system, the results are as follows: 在我的系统上,结果如下:

Fortran: 38.08 ms
Python: 104.29 ms
MATLAB: 97.36 ms

CPU use is indistinguishable in all three cases (using a steady 47-49% on an i7-920D0 processor w/ HT enabled for the duration of the calculation). 在所有三种情况下,CPU使用都无法区分(在计算期间,在i7-920D0处理器上使用稳定的47-49%w / HT)。 Furthermore, the relative performance stays roughly equal for arbitrary matrix sizes with the exception that for very small matrices (N<80 or so) it is useful to manually disable parallelization in Fortran. 此外,对于任意矩阵大小,相对性能保持大致相等,除了对于非常小的矩阵(N <80左右),在Fortran中手动禁用并行化是有用的。

Is there any established reason for the discrepancy here? 这里的差异有没有确定的原因? Am I doing something wrong? 难道我做错了什么? I would expect that at least for larger matrices Fortran would have no meaningful advantage in this case. 我希望至少对于较大的矩阵,Fortran在这种情况下没有任何有意义的优势。

You have two issues here: 你有两个问题:

  1. In Python, you time the random initialisation as well as the computation, which you don't in Fortran and MATLAB 在Python中,您可以计算随机初始化以及计算,而Fortran和MATLAB则没有
  2. In Fortran, you measure the CPU time while you measure the elapsed time in Python and MATLAB. 在Fortran中,您可以在Python和MATLAB中测量经过的时间来测量CPU时间。 And since you noticed that the CPU usage is around 46%, this might just account for the difference. 而且,由于您注意到CPU使用率约为46%,这可能只是解释了差异。

Just fix these two things and retry... You might consider using date_and_time() rather than cpu_time() for that purpose. 只需修复这两件事date_and_time()试......您可以考虑使用date_and_time()而不是cpu_time()来实现此目的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM