[英]Python matrix provide with numpy.dot()
During my acquaintance with CUDA in Python (numba lib), I implemented matrix provide methods: 在我熟悉Python(Numba lib)中的CUDA时,我实现了矩阵提供方法:
numpy.dot()
只需使用numpy.dot()
numpy.dot()
使用numpy.dot()
Strassen算法 So I tested it on 2 types of data: 所以我测试了两种类型的数据:
numpy.random.randint(0, 5, (N, N)) # with int32 elements
numpy.random.random((N, N)) # with float64 elements
For int32 i obtained expected result, where my GPU algroithms performed better than CPU with numpy: 对于int32我获得了预期的结果,我的GPU algroithms比numpy的CPU表现更好:
However, on float64 type, numpy.dot()
outperformed all my GPU methods: 但是,在float64类型上, numpy.dot()
性能优于我的所有GPU方法:
So, question is: Why is numpy.dot()
so fast with float64
arrays, and does numpy use the GPU? 所以,问题是: 为什么numpy.dot()
使用float64
数组这么快,并且numpy使用GPU吗?
A typical installation of numpy will be dynamically linked against a BLAS library , which provides routines for matrix-matrix and matrix-vector multiplication. numpy的典型安装将动态链接到BLAS库 , BLAS库提供矩阵 - 矩阵和矩阵 - 向量乘法的例程。 For example, when you use np.dot()
on a pair of float64 arrays, numpy will call the BLAS dgemm
routine in the background. 例如,当你在一对float64数组上使用np.dot()
时,numpy将在后台调用BLAS dgemm
例程 。 Although these library functions run on the CPU rather than the GPU, they are often multithreaded, and are very finely tuned for performance. 虽然这些库函数在CPU而不是GPU上运行,但它们通常是多线程的,并且针对性能进行了非常精细的调整。 A good BLAS implementation, such as MKL or OpenBLAS , will probably be hard to beat in terms of performance, even on the GPU*. 一个好的BLAS实现,如MKL或OpenBLAS ,可能在性能方面难以击败,即使在GPU *上也是如此。
However, BLAS only supports floating point types. 但是,BLAS仅支持浮点类型。 If you call np.dot()
on integer arrays, numpy will fall back on using a very simple internal C++ implementation , which is single-threaded and much slower than a BLAS dot on two floating point arrays. 如果在整数数组上调用np.dot()
,numpy将使用一个非常简单的内部C ++实现 ,它是单线程的,比两个浮点数组上的BLAS点慢得多。
Without knowing more about how you conducted those benchmarks, I would bet that a plain call to numpy.dot
would also comfortably beat your other 3 methods for float32, complex64 and complex128 arrays, which are the other 3 types supported by BLAS. 在不知道你如何进行这些基准测试的情况下,我敢打赌,对numpy.dot
的简单调用也会轻松击败你的其他3种方法,即float32,complex64和complex128数组,这是BLAS支持的其他3种类型。
* One possible way to beat standard BLAS would be to use cuBLAS , which is a BLAS implementation that will run on an NVIDIA GPU. *击败标准BLAS的一种可能方法是使用cuBLAS ,这是一种可在NVIDIA GPU上运行的BLAS实现。 The scikit-cuda
library seems to provide Python bindings for it, although I've never used it myself. scikit-cuda
库似乎为它提供了Python绑定,尽管我自己从未使用它。
I understand that numpy will automatically use multiple cpu processors where it has the libraries compiled. 我知道numpy将自动使用多个cpu处理器,它们已经编译了库。 For some functions (and I think dot() was one of the ones, though I can't find ref now). 对于某些函数(我认为dot()是其中之一,但我现在找不到ref)。 I suspect this is what's happening. 我怀疑这是发生了什么。 I'm not aware of any attempts to get a numpy gpu back end http://www.reddit.com/r/Python/comments/1mw9mb/is_there_a_gpu_backend_for_numpyscipy_money_is_no/ 我不知道有任何尝试获得numpy gpu后端http://www.reddit.com/r/Python/comments/1mw9mb/is_there_a_gpu_backend_for_numpyscipy_money_is_no/
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.