[英]fortran matrix product slows when called with f2py by python
I've been trying to use f2py to interface an optimized fortran code for vector and matrix multiplication with python.我一直在尝试使用 f2py 将优化的 fortran 代码用于向量和矩阵乘法与 python 接口。 To obtain a performance comparison useful for my purposes I perform the same product inside a cycle 100000 times.
为了获得对我有用的性能比较,我在一个周期内执行相同的产品 100000 次。 With a full fortran code the product takes 2.4 sec (ifort), while with f2py it takes approx 11 sec.
使用完整的 fortran 代码,产品需要 2.4 秒(ifort),而使用 f2py 需要大约 11 秒。 Just for reference, with numpy it takes approx 20 sec.
仅供参考,使用 numpy 大约需要 20 秒。 I ask both the fortran and the python part to write the time difference before and after the cycle and with f2py they both write 11 sec, so the code is not losing time in passing arrays.
我要求 fortran 和 python 部分写出循环前后的时间差,并且使用 f2py 他们都写了 11 秒,因此代码不会在传递数组时浪费时间。 I triyed to understand if it is the way in which numpy array are stored, but I can't understand the problem.
我试图了解这是否是 numpy 数组的存储方式,但我无法理解问题所在。 Do you have any idea?
你有什么主意吗? Thanks in advance
提前致谢
fortran Main Fortran 主程序
program Main
implicit none
save
integer :: seed, i, j, k
integer, parameter :: states =15
integer, parameter :: tessere = 400
real, dimension(tessere,states,states) :: matrix
real, dimension(states) :: vector
real :: start, finish
real :: prod(tessere)
do i=1,tessere
do j=1,states
do k=1,states
matrix(i,j,k) = i+j+k
end do
enddo
end do
do i=1,states
vector(i) = i
enddo
call doubleSum(vector,vector,matrix,states,tessere,prod)
end program
fortran subroutine: fortran 子程序:
subroutine doubleSum(ket, bra, M , states, tessere,prod)
integer :: its, j, k,t
integer :: states
integer :: tessere
real, dimension(tessere,states,states) :: M
real, dimension(states) :: ket
real, dimension(states) :: bra
real, dimension(tessere) :: prod
real,dimension(tessere,states) :: ctmp
call cpu_time(start)
do t=1,100000
ctmp=0.d0
do k=1,states
do j=1,states
do its=1,tessere
ctmp(its,k)=ctmp(its,k)+ M(its,k,j)*ket(j)
enddo
enddo
enddo
do its=1,tessere
prod(its)=dot_product(bra,ctmp(its,:))
enddo
enddo
call cpu_time(finish)
print '("Time = ",f6.3," seconds.")',finish-start
end subroutine
python script蟒蛇脚本
import numpy as np
import time
import cicloS
M= np.random.rand(400,15,15)
ket=np.random.rand(15)
#M=np.asfortranarray(M)
#ket=np.asfortranarray(ket)
import time
start=time.time()
prod=cicloS.doublesum(ket,ket,M)
end=time.time()
print(end-start)
.pyf file generated with f2py and edited使用 f2py 生成并编辑的 .pyf 文件
! -*- f90 -*-
! Note: the context of this file is case sensitive.
python module cicloS
interface
subroutine doublesum(ket,bra,m,states,tessere,prod)
real dimension(states),intent(in) :: ket
real dimension(states),depend(states),intent(in) :: bra
real dimension(tessere,states,states),depend(states,states),intent(in) :: m
integer, optional,check(len(ket)>=states),depend(ket) :: states=len(ket)
integer, optional,check(shape(m,0)==tessere),depend(m) :: tessere=shape(m,0)
real dimension(tessere),intent(out) :: prod
end subroutine doublesum
end interface
end python module cicloS
The OP has indicated that the observed execution time difference, between standalone and F2PY compiled versions of the code, was due to different compilers and compiler flags being used. OP 表示,观察到的代码的独立版本和 F2PY 编译版本之间的执行时间差异是由于使用了不同的编译器和编译器标志。
In order to obtain consistent result, and thereby answer the question, it is necessary to ensure that F2PY uses the desired 1) compiler, and 2) compiler flags.为了获得一致的结果,从而回答问题,有必要确保 F2PY 使用所需的 1) 编译器和 2) 编译器标志。
A list of Fortran compilers available to F2PY on the target system can be displayed by executing eg python -m numpy.f2py -c --help-fcompiler
.可以通过执行例如
python -m numpy.f2py -c --help-fcompiler
来显示目标系统上 F2PY 可用的 Fortran 编译器列表。 On my system, this produces (truncated):在我的系统上,这会产生(截断):
Fortran compilers found:
--fcompiler=gnu95 GNU Fortran 95 compiler (7)
--fcompiler=intelem Intel Fortran Compiler for 64-bit apps (19.0.1.144)
You can instruct F2PY which of the available Fortran compilers to use, by adding an appropriate --fcompiler
flag to your compile command.您可以通过向编译命令添加适当的
--fcompiler
标志来指示 F2PY 使用哪些可用的 Fortran 编译器。 For using ifort
eg (assuming you have already created and edited a cicloS.pyf
file):例如使用
ifort
(假设您已经创建并编辑了cicloS.pyf
文件):
python -m numpy.f2py --fcompiler=intelem -c cicloS.pyf sub.f90
Note that the output from --help-fcompiler
in the previous step also displays the default compiler flags (see eg compiler_f90
) that F2PY defines for each available compiler.请注意,上一步中
--help-fcompiler
的输出还显示 F2PY 为每个可用编译器定义的默认编译器标志(参见例如compiler_f90
)。 Again on my system, this was (truncated and simplified to most relevant flags):再次在我的系统上,这是(截断并简化为最相关的标志):
-O3 -funroll-loops
-O3 -funroll-loops
-O3 -xSSE4.2 -axCORE-AVX2,COMMON-AVX512
-O3 -xSSE4.2 -axCORE-AVX2,COMMON-AVX512
You can the specify preferred optimisation flags for F2PY with the --opt
flag in you compile command (see also --f90flags
in the documentation ), that now becomes eg:您可以在编译命令中使用
--opt
标志为 F2PY 指定首选优化标志(另请参阅文档中的--f90flags
),现在变为例如:
python -m numpy.f2py --fcompiler=intelem --opt='-O1' -c cicloS.pyf sub.f90
Compiling a standalone executable with ifort -O1 sub.f90 main.f90 -o main
, and the F2PY compiled version from Part 2 , I get the following output:使用
ifort -O1 sub.f90 main.f90 -o main
编译独立可执行文件,以及来自Part 2的 F2PY 编译版本,我得到以下输出:
./main
Time = 5.359 seconds.
python test.py
Time = 5.297 seconds.
5.316878795623779
Then, compiling a standalone executable with ifort -O3 sub.f90 main.f90 -o main
, and the (default) F2PY compiled version from Part 1 , I get these results:然后,使用
ifort -O3 sub.f90 main.f90 -o main
和第 1 部分的(默认)F2PY 编译版本编译独立可执行文件,我得到以下结果:
./main
Time = 1.297 seconds.
python test.py
Time = 1.219 seconds.
1.209657907485962
Thus showing similar results for the standalone and F2PY versions, as well as the influence of compiler flags.因此显示了独立版本和 F2PY 版本的类似结果,以及编译器标志的影响。
Although not the cause of the slowdown you observe, do note that F2PY is forced to make temporary copies of the arrays M
(and ket
) in your Python example for two reasons:尽管不是您观察到的速度变慢的原因,但请注意,出于两个原因,F2PY 被迫在您的 Python 示例中制作数组
M
(和ket
)的临时副本:
M
that you pass to cicloS.doublesum()
is a default NumPy array, with C ordering (row-major).cicloS.doublesum()
的 3D 数组M
是默认的 NumPy 数组,具有 C 排序(行cicloS.doublesum()
)。 Since Fortran uses column-major ordering, F2PY will make array copies.np.asfortranarray()
should correct this part of the problem.np.asfortranarray()
应该纠正这部分问题。ket
) is that there is a mismatch between the real kinds on the Python (default 64bit, double precision float) and Fortran ( real
gives a default precision, likely 32bit float) sides of your example.ket
)的下一个原因是 Python(默认 64 位,双精度浮点数)和 Fortran( real
给出默认精度,可能是 32 位浮点数)方面的真实类型之间存在不匹配。 So copies are again made to account for this. You can get notification when array copies are made by adding a -DF2PY_REPORT_ON_ARRAY_COPY=1
flag (also in documentation ) to your F2PY compile command.通过将
-DF2PY_REPORT_ON_ARRAY_COPY=1
标志(也在文档中)添加到您的 F2PY 编译命令,您可以在制作数组副本时收到通知。 In your case, array copies can be avoided completely by changing the dtype
of your M
and ket
matrices in Python (ie M=np.asfortranarray(M, dtype=np.float32))
and ket=np.asfortranarray(ket, dtype=np.float32))
, or alternatively by defining the real
variables in your Fortran code with the appropriate kind
(eg add use, intrinsic :: iso_fortran_env, only : real64
to your subroutine and main program and define reals with real(kind=real64)
.在你的情况,阵列拷贝可以完全通过改变避免
dtype
您的M
和ket
矩阵在Python(即M=np.asfortranarray(M, dtype=np.float32))
和ket=np.asfortranarray(ket, dtype=np.float32))
,或者通过在您的 Fortran 代码中使用适当的kind
定义real
变量(例如,将use, intrinsic :: iso_fortran_env, only : real64
到您的子程序和主程序中,并使用real(kind=real64)
定义实数.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.