python 使用 f2py 调用时，fortran 矩阵乘积变慢

Question

I've been trying to use f2py to interface an optimized fortran code for vector and matrix multiplication with python.我一直在尝试使用 f2py 将优化的 fortran 代码用于向量和矩阵乘法与 python 接口。 To obtain a performance comparison useful for my purposes I perform the same product inside a cycle 100000 times.为了获得对我有用的性能比较，我在一个周期内执行相同的产品 100000 次。 With a full fortran code the product takes 2.4 sec (ifort), while with f2py it takes approx 11 sec.使用完整的 fortran 代码，产品需要 2.4 秒（ifort），而使用 f2py 需要大约 11 秒。 Just for reference, with numpy it takes approx 20 sec.仅供参考，使用 numpy 大约需要 20 秒。 I ask both the fortran and the python part to write the time difference before and after the cycle and with f2py they both write 11 sec, so the code is not losing time in passing arrays.我要求 fortran 和 python 部分写出循环前后的时间差，并且使用 f2py 他们都写了 11 秒，因此代码不会在传递数组时浪费时间。 I triyed to understand if it is the way in which numpy array are stored, but I can't understand the problem.我试图了解这是否是 numpy 数组的存储方式，但我无法理解问题所在。 Do you have any idea?你有什么主意吗？ Thanks in advance提前致谢

fortran Main Fortran 主程序

program Main
    implicit none
    save

    integer :: seed, i, j, k
    integer, parameter :: states =15
    integer, parameter :: tessere = 400
    real, dimension(tessere,states,states) :: matrix
    real, dimension(states) :: vector
    real :: start, finish
    real  :: prod(tessere)

    do i=1,tessere
       do j=1,states
          do k=1,states
              matrix(i,j,k) = i+j+k
          end do
       enddo
    end do
    do i=1,states
        vector(i) = i
    enddo
    call doubleSum(vector,vector,matrix,states,tessere,prod)

end program

fortran subroutine: fortran 子程序：

subroutine doubleSum(ket, bra, M , states, tessere,prod)
    integer :: its, j, k,t
    integer :: states
    integer :: tessere
    real, dimension(tessere,states,states) :: M
    real, dimension(states) :: ket
    real, dimension(states) :: bra
    real, dimension(tessere) :: prod
    real,dimension(tessere,states) :: ctmp

    call cpu_time(start)
    do t=1,100000
        ctmp=0.d0
        do k=1,states
             do j=1,states
                do its=1,tessere
                   ctmp(its,k)=ctmp(its,k)+ M(its,k,j)*ket(j)
                enddo
             enddo
        enddo
        do its=1,tessere
            prod(its)=dot_product(bra,ctmp(its,:))
        enddo
    enddo
    call cpu_time(finish)
    print '("Time = ",f6.3," seconds.")',finish-start
end subroutine

python script蟒蛇脚本

import numpy as np
import time
import cicloS


M= np.random.rand(400,15,15)
ket=np.random.rand(15)

#M=np.asfortranarray(M)
#ket=np.asfortranarray(ket)

import time


start=time.time()  
prod=cicloS.doublesum(ket,ket,M)
end=time.time()
print(end-start)

.pyf file generated with f2py and edited使用 f2py 生成并编辑的 .pyf 文件

!    -*- f90 -*-
! Note: the context of this file is case sensitive.

python module cicloS 
    interface  
        subroutine doublesum(ket,bra,m,states,tessere,prod) 
            real dimension(states),intent(in) :: ket
            real dimension(states),depend(states),intent(in) :: bra
            real dimension(tessere,states,states),depend(states,states),intent(in) :: m
            integer, optional,check(len(ket)>=states),depend(ket) :: states=len(ket)
            integer, optional,check(shape(m,0)==tessere),depend(m) :: tessere=shape(m,0)
            real dimension(tessere),intent(out) :: prod
        end subroutine doublesum
    end interface
end python module cicloS

Answer 1

The OP has indicated that the observed execution time difference, between standalone and F2PY compiled versions of the code, was due to different compilers and compiler flags being used. OP 表示，观察到的代码的独立版本和 F2PY 编译版本之间的执行时间差异是由于使用了不同的编译器和编译器标志。

In order to obtain consistent result, and thereby answer the question, it is necessary to ensure that F2PY uses the desired 1) compiler, and 2) compiler flags.为了获得一致的结果，从而回答问题，有必要确保 F2PY 使用所需的 1) 编译器和 2) 编译器标志。

Part 1: Specify which Fortran compiler* should be used by F2PY第 1 部分：指定 F2PY 应使用哪个Fortran 编译器*

A list of Fortran compilers available to F2PY on the target system can be displayed by executing eg python -m numpy.f2py -c --help-fcompiler .可以通过执行例如python -m numpy.f2py -c --help-fcompiler来显示目标系统上 F2PY 可用的 Fortran 编译器列表。 On my system, this produces (truncated):在我的系统上，这会产生（截断）：

Fortran compilers found:
  --fcompiler=gnu95    GNU Fortran 95 compiler (7)
  --fcompiler=intelem  Intel Fortran Compiler for 64-bit apps (19.0.1.144)

You can instruct F2PY which of the available Fortran compilers to use, by adding an appropriate --fcompiler flag to your compile command.您可以通过向编译命令添加适当的--fcompiler标志来指示 F2PY 使用哪些可用的 Fortran 编译器。 For using ifort eg (assuming you have already created and edited a cicloS.pyf file):例如使用ifort （假设您已经创建并编辑了cicloS.pyf文件）：

python -m numpy.f2py --fcompiler=intelem -c cicloS.pyf sub.f90

Part 2: Specify additional compiler flags* to be used by F2PY第 2 部分：指定 F2PY 使用的其他编译器标志*

Note that the output from --help-fcompiler in the previous step also displays the default compiler flags (see eg compiler_f90 ) that F2PY defines for each available compiler.请注意，上一步中--help-fcompiler的输出还显示 F2PY 为每个可用编译器定义的默认编译器标志（参见例如compiler_f90 ）。 Again on my system, this was (truncated and simplified to most relevant flags):再次在我的系统上，这是（截断并简化为最相关的标志）：

gnu95: -O3 -funroll-loops gnu95： -O3 -funroll-loops
intelem: -O3 -xSSE4.2 -axCORE-AVX2,COMMON-AVX512 intelem： -O3 -xSSE4.2 -axCORE-AVX2,COMMON-AVX512

You can the specify preferred optimisation flags for F2PY with the --opt flag in you compile command (see also --f90flags in the documentation ), that now becomes eg:您可以在编译命令中使用--opt标志为 F2PY 指定首选优化标志（另请参阅文档中的--f90flags ），现在变为例如：

python -m numpy.f2py --fcompiler=intelem --opt='-O1' -c cicloS.pyf sub.f90

Compare run time for standalone and F2PY versions:比较独立版本和 F2PY 版本的运行时间：

Compiling a standalone executable with ifort -O1 sub.f90 main.f90 -o main , and the F2PY compiled version from Part 2 , I get the following output:使用ifort -O1 sub.f90 main.f90 -o main编译独立可执行文件，以及来自Part 2的 F2PY 编译版本，我得到以下输出：

./main
Time =  5.359 seconds.

python test.py
Time =  5.297 seconds.
5.316878795623779

Then, compiling a standalone executable with ifort -O3 sub.f90 main.f90 -o main , and the (default) F2PY compiled version from Part 1 , I get these results:然后，使用ifort -O3 sub.f90 main.f90 -o main和第 1 部分的（默认）F2PY 编译版本编译独立可执行文件，我得到以下结果：

./main
Time =  1.297 seconds.

python test.py
Time =  1.219 seconds.
1.209657907485962

Thus showing similar results for the standalone and F2PY versions, as well as the influence of compiler flags.因此显示了独立版本和 F2PY 版本的类似结果，以及编译器标志的影响。

Comment on temporary arrays对临时数组的评论

Although not the cause of the slowdown you observe, do note that F2PY is forced to make temporary copies of the arrays M (and ket ) in your Python example for two reasons:尽管不是您观察到的速度变慢的原因，但请注意，出于两个原因，F2PY 被迫在您的 Python 示例中制作数组M （和ket ）的临时副本：

the 3D array M that you pass to cicloS.doublesum() is a default NumPy array, with C ordering (row-major).您传递给cicloS.doublesum()的 3D 数组M是默认的 NumPy 数组，具有 C 排序（行cicloS.doublesum() ）。 Since Fortran uses column-major ordering, F2PY will make array copies.由于 Fortran 使用列优先排序，F2PY 将制作数组副本。 The commented out np.asfortranarray() should correct this part of the problem.注释掉的np.asfortranarray()应该纠正这部分问题。
the next reason for array copies (also for ket ) is that there is a mismatch between the real kinds on the Python (default 64bit, double precision float) and Fortran ( real gives a default precision, likely 32bit float) sides of your example.数组副本（也适用于ket ）的下一个原因是 Python（默认 64 位，双精度浮点数）和 Fortran（ real给出默认精度，可能是 32 位浮点数）方面的真实类型之间存在不匹配。 So copies are again made to account for this.因此，再次制作副本来说明这一点。

You can get notification when array copies are made by adding a -DF2PY_REPORT_ON_ARRAY_COPY=1 flag (also in documentation ) to your F2PY compile command.通过将-DF2PY_REPORT_ON_ARRAY_COPY=1标志（也在文档中）添加到您的 F2PY 编译命令，您可以在制作数组副本时收到通知。 In your case, array copies can be avoided completely by changing the dtype of your M and ket matrices in Python (ie M=np.asfortranarray(M, dtype=np.float32)) and ket=np.asfortranarray(ket, dtype=np.float32)) , or alternatively by defining the real variables in your Fortran code with the appropriate kind (eg add use, intrinsic :: iso_fortran_env, only : real64 to your subroutine and main program and define reals with real(kind=real64) .在你的情况，阵列拷贝可以完全通过改变避免dtype您的M和ket矩阵在Python（即M=np.asfortranarray(M, dtype=np.float32))和ket=np.asfortranarray(ket, dtype=np.float32)) ，或者通过在您的 Fortran 代码中使用适当的kind定义real变量（例如，将use, intrinsic :: iso_fortran_env, only : real64到您的子程序和主程序中，并使用real(kind=real64)定义实数.

python 使用 f2py 调用时，fortran 矩阵乘积变慢

问题描述

1 个解决方案

解决方案1
3 已采纳 2019-02-01 10:32:32

Part 1: Specify which Fortran compiler* should be used by F2PY第 1 部分：指定 F2PY 应使用哪个Fortran 编译器*

Part 2: Specify additional compiler flags* to be used by F2PY第 2 部分：指定 F2PY 使用的其他编译器标志*

Compare run time for standalone and F2PY versions:比较独立版本和 F2PY 版本的运行时间：

Comment on temporary arrays对临时数组的评论

python 使用 f2py 调用时，fortran 矩阵乘积变慢

问题描述

1 个解决方案

解决方案1 3 已采纳 2019-02-01 10:32:32

Part 1: Specify which Fortran compiler should be used by F2PY第 1 部分：指定 F2PY 应使用哪个Fortran 编译器

Part 2: Specify additional compiler flags to be used by F2PY第 2 部分：指定 F2PY 使用的其他编译器标志

Compare run time for standalone and F2PY versions:比较独立版本和 F2PY 版本的运行时间：

Comment on temporary arrays对临时数组的评论

解决方案1
3 已采纳 2019-02-01 10:32:32

Part 1: Specify which Fortran compiler* should be used by F2PY第 1 部分：指定 F2PY 应使用哪个Fortran 编译器*

Part 2: Specify additional compiler flags* to be used by F2PY第 2 部分：指定 F2PY 使用的其他编译器标志*