简体   繁体   English

加速numpy.dot

[英]Speeding up numpy.dot

I've got a numpy script that spends about 50% of its runtime in the following code: 我有一个numpy脚本,它在以下代码中numpy了大约50%的运行时间:

s = numpy.dot(v1, v1)

where 哪里

v1 = v[1:]

and v is a 4000-element 1D ndarray of float64 stored in contiguous memory ( v.strides is (8,) ). v是存储在连续内存中的4000个元素ndarray of float64v.strides(8,) )。

Any suggestions for speeding this up? 有什么建议加快这个?

edit This is on Intel hardware. 编辑这是在Intel硬件上。 Here is the output of my numpy.show_config() : 这是我的numpy.show_config()的输出:

atlas_threads_info:
    libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
    library_dirs = ['/usr/local/atlas-3.9.16/lib']
    language = f77
    include_dirs = ['/usr/local/atlas-3.9.16/include']

blas_opt_info:
    libraries = ['ptf77blas', 'ptcblas', 'atlas']
    library_dirs = ['/usr/local/atlas-3.9.16/lib']
    define_macros = [('ATLAS_INFO', '"\\"3.9.16\\""')]
    language = c
    include_dirs = ['/usr/local/atlas-3.9.16/include']

atlas_blas_threads_info:
    libraries = ['ptf77blas', 'ptcblas', 'atlas']
    library_dirs = ['/usr/local/atlas-3.9.16/lib']
    language = c
    include_dirs = ['/usr/local/atlas-3.9.16/include']

lapack_opt_info:
    libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
    library_dirs = ['/usr/local/atlas-3.9.16/lib']
    define_macros = [('ATLAS_INFO', '"\\"3.9.16\\""')]
    language = f77
    include_dirs = ['/usr/local/atlas-3.9.16/include']

lapack_mkl_info:
  NOT AVAILABLE

blas_mkl_info:
  NOT AVAILABLE

mkl_info:
  NOT AVAILABLE

Perhaps the culprit is copying of the arrays passed to dot . 也许罪魁祸首是复制传递给dot的数组。

As Sven said, the dot product relies on BLAS operations. 正如斯文所说, 积依赖于BLAS操作。 These operations require arrays stored in contiguous C order. 这些操作需要以连续的C顺序存储的数组。 If both arrays passed to dot are in C_CONTIGUOUS, you ought to see better performance. 如果传递给dot的两个数组都在C_CONTIGUOUS中,那么你应该看到更好的性能。

Of course, if your two arrays passed to dot are indeed 1D (8,) then you should see both the C_CONTIGUOUS AND F_CONTIGUOUS flags set to True; 当然,如果你的两个数组传递给点确实1D(8),那么你应该看到的C_CONTIGUOUS并设置为True F_CONTIGUOUS标志; but if they are (1, 8), then you can see mixed order. 但如果它们是(1,8),那么你可以看到混合顺序。

>>> w = NP.random.randint(0, 10, 100).reshape(100, 1)
>>> w.flags
   C_CONTIGUOUS : True
   F_CONTIGUOUS : False
   OWNDATA : False
   WRITEABLE : True
   ALIGNED : True
   UPDATEIFCOPY : False


An alternative: use _GEMM from BLAS, which is exposed through the module, scipy.linalg.fblas . 另一种方法:使用BLAS中的_GEMM,它通过模块scipy.linalg.fblas公开 (The two arrays, A and B, are obviously in Fortran order because fblas is used.) (两个数组A和B显然是Fortran顺序,因为使用了fblas 。)

from scipy.linalg import fblas as FB
X = FB.dgemm(alpha=1., a=A, b=B, trans_b=True)

Your arrays are not very big, so ATLAS probably isn't doing much. 你的阵列不是很大,所以ATLAS可能做得不多。 What are your timings for the following Fortran program? 您对以下Fortran计划的时间安排是什么? Assuming ATLAS isn't doing much, this should give you a sense of how fast dot() could be if there was not any python overhead. 假设ATLAS没有做太多,这应该让你了解如果没有任何python开销,dot()的速度有多快。 With gfortran -O3 I get speeds of 5 +/- 0.5 us. 使用gfortran -O3,我获得了5 +/- 0.5 us的速度。

    program test

    real*8 :: x(4000), start, finish, s
    integer :: i, j
    integer,parameter :: jmax = 100000

    x(:) = 4.65
    s = 0.
    call cpu_time(start)
    do j=1,jmax
        s = s + dot_product(x, x)
    enddo
    call cpu_time(finish)
    print *, (finish-start)/jmax * 1.e6, s

    end program test

Only thing I can think of to accelerate this is to make sure your NumPy installation is compiled against an optimized BLAS library (like ATLAS). 我唯一可以想到的是加速这一点是为了确保你的NumPy安装是针对优化的BLAS库(如ATLAS)编译的。 numpy.dot() is one of only a few NumPy functions that make use of BLAS. numpy.dot()是使用BLAS的少数几个NumPy函数之一。

numpy.dot will use multithreading if compiled correctly. 如果编译正确,numpy.dot将使用多线程。 Make sure that it does with top. 确保它与顶部一起。 I know of cases where people didn't get multithreading in numpy w/ atlas to work. 我知道人们没有在numpy w / atlas中进行多线程工作的情况。 Furthermore, it's worth trying to use a numpy version that is compiled against the intel mkl libraries. 此外,值得尝试使用针对intel mkl库编译的numpy版本。 They include blas routines that are supposed to be faster than atlas on intel hardware. 它们包括应该比英特尔硬件上的地图集更快的blas例程。 You could give enthought's python distro a try. 你可以试试enthought的python发行版。 Contains all this and is free for people with an edu email account. 包含所有这些,对于拥有edu电子邮件帐户的人来说是免费的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM