numpy怎么能比我的Fortran例程快得多？

Question

I get a 512^3 array representing a Temperature distribution from a simulation (written in Fortran). 我得到一个512 ^ 3数组，表示模拟的温度分布（用Fortran编写）。 The array is stored in a binary file that's about 1/2G in size. 该数组存储在大小约为1 / 2G的二进制文件中。 I need to know the minimum, maximum and mean of this array and as I will soon need to understand Fortran code anyway, I decided to give it a go and came up with the following very easy routine. 我需要知道这个数组的最小值，最大值和平均值，因为我很快就需要了解Fortran代码，我决定试一试，并提出了以下非常简单的例程。

  integer gridsize,unit,j
  real mini,maxi
  double precision mean

  gridsize=512
  unit=40
  open(unit=unit,file='T.out',status='old',access='stream',&
       form='unformatted',action='read')
  read(unit=unit) tmp
  mini=tmp
  maxi=tmp
  mean=tmp
  do j=2,gridsize**3
      read(unit=unit) tmp
      if(tmp>maxi)then
          maxi=tmp
      elseif(tmp<mini)then
          mini=tmp
      end if
      mean=mean+tmp
  end do
  mean=mean/gridsize**3
  close(unit=unit)

This takes about 25 seconds per file on the machine I use. 在我使用的机器上，每个文件大约需要25秒。 That struck me as being rather long and so I went ahead and did the following in Python: 这让我觉得很长，所以我继续在Python中做了以下事情：

    import numpy

    mmap=numpy.memmap('T.out',dtype='float32',mode='r',offset=4,\
                                  shape=(512,512,512),order='F')
    mini=numpy.amin(mmap)
    maxi=numpy.amax(mmap)
    mean=numpy.mean(mmap)

Now, I expected this to be faster of course, but I was really blown away. 现在，我预计这会更快，但我真的被吹走了。 It takes less than a second under identical conditions. 在相同条件下只需不到一秒钟。 The mean deviates from the one my Fortran routine finds (which I also ran with 128-bit floats, so I somehow trust it more) but only on the 7th significant digit or so. 平均值偏离我的Fortran例程找到的那个（我也使用128位浮点运行，所以我不知何故更多地信任它），但仅限于第7位有效数字左右。

How can numpy be so fast? numpy怎么这么快？ I mean you have to look at every entry of an array to find these values, right? 我的意思是你必须查看数组的每个条目才能找到这些值，对吧？ Am I doing something very stupid in my Fortran routine for it to take so much longer? 我在Fortran例行程中做了一件非常愚蠢的事情，因为它花了这么长时间吗？

EDIT: 编辑：

To answer the questions in the comments: 要回答评论中的问题：

Yes, also I ran the Fortran routine with 32-bit and 64-bit floats but it had no impact on performance. 是的，我也使用32位和64位浮点运行Fortran例程，但它对性能没有影响。
I used iso_fortran_env which provides 128-bit floats. 我使用了iso_fortran_env ，它提供了128位浮点数。
Using 32-bit floats my mean is off quite a bit though, so precision is really an issue. 使用32位浮点数我的意思是相当多，所以精度确实是一个问题。
I ran both routines on different files in different order, so the caching should have been fair in the comparison I guess ? 我以不同的顺序在不同的文件上运行这两个例程，所以缓存在比较中应该是公平的吗？
I actually tried open MP, but to read from the file at different positions at the same time. 我实际上试过打开MP，但同时从不同位置的文件中读取。 Having read your comments and answers this sounds really stupid now and it made the routine take a lot longer as well. 阅读完你的评论和答案后，这听起来真的很愚蠢，这使得日常工作也需要更长的时间。 I might give it a try on the array operations but maybe that won't even be necessary. 我可能试一试数组操作，但也许甚至不需要。
The files are actually 1/2G in size, that was a typo, Thanks. 文件实际上是1 / 2G大小，这是一个错字，谢谢。
I will try the array implementation now. 我现在将尝试数组实现。

EDIT 2: 编辑2：

I implemented what @Alexander Vogt and @casey suggested in their answers, and it is as fast as numpy but now I have a precision problem as @Luaan pointed out I might get. 我实现了@Alexander Vogt和@casey在他们的答案中提出的建议，它和numpy一样快，但现在我有一个精确的问题，因为@Luaan指出我可能会得到。 Using a 32-bit float array the mean computed by sum is 20% off. 使用32位浮点数组，由sum计算的平均值为20％。 Doing 干

...
real,allocatable :: tmp (:,:,:)
double precision,allocatable :: tmp2(:,:,:)
...
tmp2=tmp
mean=sum(tmp2)/size(tmp)
...

Solves the issue but increases computing time (not by very much, but noticeably). 解决了这个问题但增加了计算时间（不是很多，但显着）。 Is there a better way to get around this issue? 有没有更好的方法来解决这个问题？ I couldn't find a way to read singles from the file directly to doubles. 我找不到从文件中直接读单打的方法。 And how does numpy avoid this? numpy如何避免这种情况？

Thanks for all the help so far. 感谢目前为止所有的帮助。

Answer 1

Your Fortran implementation suffers two major shortcomings: 您的Fortran实现存在两个主要缺点：

You mix IO and computations (and read from the file entry by entry). 您混合IO和计算（并通过条目从文件条目读取）。
You don't use vector/matrix operations. 您不使用矢量/矩阵运算。

This implementation does perform the same operation as yours and is faster by a factor of 20 on my machine: 此实现确实执行与您相同的操作，并且在我的机器上运行速度提高了20倍：

program test
  integer gridsize,unit
  real mini,maxi,mean
  real, allocatable :: tmp (:,:,:)

  gridsize=512
  unit=40

  allocate( tmp(gridsize, gridsize, gridsize))

  open(unit=unit,file='T.out',status='old',access='stream',&
       form='unformatted',action='read')
  read(unit=unit) tmp

  close(unit=unit)

  mini = minval(tmp)
  maxi = maxval(tmp)
  mean = sum(tmp)/gridsize**3
  print *, mini, maxi, mean

end program

The idea is to read in the whole file into one array tmp in one go. 我们的想法是将整个文件一次性读入一个数组tmp 。 Then, I can use the functions MAXVAL , MINVAL , and SUM on the array directly. 然后，我可以直接在数组上使用MAXVAL ， MINVAL和SUM函数。

For the accuracy issue: Simply using double precision values and doing the conversion on the fly as 对于准确性问题：只需使用双精度值并即时进行转换

mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))

only marginally increases the calculation time. 只是略微增加了计算时间。 I tried performing the operation element-wise and in slices, but that did only increase the required time at the default optimization level. 我尝试在切片中执行元素操作，但这只会增加默认优化级别所需的时间。

At -O3 , the element-wise addition performs ~3 % better than the array operation. 在-O3 ，元素加法比阵列操作好大约3％。 The difference between double and single precision operations is less than 2% on my machine - on average (the individual runs deviate by far more). 在我的机器上，双精度和单精度操作之间的差异小于2％ - 平均而言（个别运行偏差更多）。

Here is a very fast implementation using LAPACK: 这是使用LAPACK的非常快速的实现：

program test
  integer gridsize,unit, i, j
  real mini,maxi
  integer  :: t1, t2, rate
  real, allocatable :: tmp (:,:,:)
  real, allocatable :: work(:)
!  double precision :: mean
  real :: mean
  real :: slange

  call system_clock(count_rate=rate)
  call system_clock(t1)
  gridsize=512
  unit=40

  allocate( tmp(gridsize, gridsize, gridsize), work(gridsize))

  open(unit=unit,file='T.out',status='old',access='stream',&
       form='unformatted',action='read')
  read(unit=unit) tmp

  close(unit=unit)

  mini = minval(tmp)
  maxi = maxval(tmp)

!  mean = sum(tmp)/gridsize**3
!  mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))
  mean = 0.d0
  do j=1,gridsize
    do i=1,gridsize
      mean = mean + slange('1', gridsize, 1, tmp(:,i,j),gridsize, work)
    enddo !i
  enddo !j
  mean = mean / gridsize**3

  print *, mini, maxi, mean
  call system_clock(t2)
  print *,real(t2-t1)/real(rate)

end program

This uses the single precision matrix 1-norm SLANGE on matrix columns. 这在矩阵列上使用单精度矩阵1范数SLANGE 。 The run-time is even faster than the approach using single precision array functions - and does not show the precision issue. 运行时甚至比使用单精度数组函数的方法更快 - 并且没有显示精度问题。

Answer 2

The numpy is faster because you wrote much more efficient code in python (and much of the numpy backend is written in optimized Fortran and C) and terribly inefficient code in Fortran. numpy更快，因为你在python中编写了更高效的代码（并且大部分numpy后端是用优化的Fortran和C编写的）和Fortran中非常低效的代码。

Look at your python code. 看看你的python代码。 You load the entire array at once and then call functions that can operate on an array. 您立即加载整个数组，然后调用可以在阵列上运行的函数。

Look at your fortran code. 看看你的fortran代码。 You read one value at a time and do some branching logic with it. 您一次读取一个值并使用它执行一些分支逻辑。

The majority of your discrepancy is the fragmented IO you have written in Fortran. 您的大部分差异是您在Fortran中编写的碎片IO。

You can write the Fortran just about the same way as you wrote the python and you'll find it runs much faster that way. 您可以像编写python一样编写Fortran，你会发现它的运行速度要快得多。

program test
  implicit none
  integer :: gridsize, unit
  real :: mini, maxi, mean
  real, allocatable :: array(:,:,:)

  gridsize=512
  allocate(array(gridsize,gridsize,gridsize))
  unit=40
  open(unit=unit, file='T.out', status='old', access='stream',&
       form='unformatted', action='read')
  read(unit) array    
  maxi = maxval(array)
  mini = minval(array)
  mean = sum(array)/size(array)
  close(unit)
end program test

numpy怎么能比我的Fortran例程快得多？

问题描述

2 个解决方案

解决方案1
110 已采纳 2015-11-15 20:07:33

解决方案2
55 2015-11-15 20:18:31

numpy怎么能比我的Fortran例程快得多？

问题描述

2 个解决方案

解决方案1 110 已采纳 2015-11-15 20:07:33

解决方案2 55 2015-11-15 20:18:31

解决方案1
110 已采纳 2015-11-15 20:07:33

解决方案2
55 2015-11-15 20:18:31