简体繁体 English

Python矩阵提供numpy.dot（）

[英]Python matrix provide with numpy.dot()

原文 2015-04-29 11:24:14 0 2 python/ performance/ numpy/ matrix-multiplication

During my acquaintance with CUDA in Python (numba lib), I implemented matrix provide methods: 在我熟悉Python（Numba lib）中的CUDA时，我实现了矩阵提供方法：

Just with numpy.dot() 只需使用numpy.dot()
Strassen algorithm with numpy.dot() 使用numpy.dot() Strassen算法
Blocks method on GPU GPU上的块方法
Strassen algorithm on GPU GPU上的Strassen算法

So I tested it on 2 types of data: 所以我测试了两种类型的数据：

numpy.random.randint(0, 5, (N, N)) # with int32 elements
numpy.random.random((N, N)) # with float64 elements

For int32 i obtained expected result, where my GPU algroithms performed better than CPU with numpy: 对于int32我获得了预期的结果，我的GPU algroithms比numpy的CPU表现更好： 在此输入图像描述

However, on float64 type, numpy.dot() outperformed all my GPU methods: 但是，在float64类型上， numpy.dot()性能优于我的所有GPU方法： 在此输入图像描述

So, question is: Why is numpy.dot() so fast with float64 arrays, and does numpy use the GPU? 所以，问题是： 为什么numpy.dot()使用float64数组这么快，并且numpy使用GPU吗？

2 个解决方案

A typical installation of numpy will be dynamically linked against a BLAS library , which provides routines for matrix-matrix and matrix-vector multiplication. numpy的典型安装将动态链接到BLAS库， BLAS库提供矩阵 - 矩阵和矩阵 - 向量乘法的例程。 For example, when you use np.dot() on a pair of float64 arrays, numpy will call the BLAS dgemm routine in the background. 例如，当你在一对float64数组上使用np.dot()时，numpy将在后台调用BLAS dgemm例程。 Although these library functions run on the CPU rather than the GPU, they are often multithreaded, and are very finely tuned for performance. 虽然这些库函数在CPU而不是GPU上运行，但它们通常是多线程的，并且针对性能进行了非常精细的调整。 A good BLAS implementation, such as MKL or OpenBLAS , will probably be hard to beat in terms of performance, even on the GPU*. 一个好的BLAS实现，如MKL或OpenBLAS ，可能在性能方面难以击败，即使在GPU *上也是如此。

However, BLAS only supports floating point types. 但是，BLAS仅支持浮点类型。 If you call np.dot() on integer arrays, numpy will fall back on using a very simple internal C++ implementation , which is single-threaded and much slower than a BLAS dot on two floating point arrays. 如果在整数数组上调用np.dot() ，numpy将使用一个非常简单的内部C ++实现，它是单线程的，比两个浮点数组上的BLAS点慢得多。

Without knowing more about how you conducted those benchmarks, I would bet that a plain call to numpy.dot would also comfortably beat your other 3 methods for float32, complex64 and complex128 arrays, which are the other 3 types supported by BLAS. 在不知道你如何进行这些基准测试的情况下，我敢打赌，对numpy.dot的简单调用也会轻松击败你的其他3种方法，即float32，complex64和complex128数组，这是BLAS支持的其他3种类型。

* One possible way to beat standard BLAS would be to use cuBLAS , which is a BLAS implementation that will run on an NVIDIA GPU. *击败标准BLAS的一种可能方法是使用cuBLAS ，这是一种可在NVIDIA GPU上运行的BLAS实现。 The scikit-cuda library seems to provide Python bindings for it, although I've never used it myself. scikit-cuda库似乎为它提供了Python绑定，尽管我自己从未使用它。

I understand that numpy will automatically use multiple cpu processors where it has the libraries compiled. 我知道numpy将自动使用多个cpu处理器，它们已经编译了库。 For some functions (and I think dot() was one of the ones, though I can't find ref now). 对于某些函数（我认为dot（）是其中之一，但我现在找不到ref）。 I suspect this is what's happening. 我怀疑这是发生了什么。 I'm not aware of any attempts to get a numpy gpu back end http://www.reddit.com/r/Python/comments/1mw9mb/is_there_a_gpu_backend_for_numpyscipy_money_is_no/ 我不知道有任何尝试获得numpy gpu后端http://www.reddit.com/r/Python/comments/1mw9mb/is_there_a_gpu_backend_for_numpyscipy_money_is_no/