MATLAB矩阵乘法性能比NumPy快5倍

Question

我在MATLAB和Python中设置了两个相同的测试，用于矩阵乘法和广播。 对于Python，我使用了NumPy，对于MATLAB，我使用了使用BLAS的mtimesx库。

MATLAB

close all; clear;

N = 1000 + 100; % a few initial runs to be trimmed off at the end

a = 100;
b = 30;
c = 40;
d = 50;
A = rand(b, c, a);
B = rand(c, d, a);
C = zeros(b, d, a);

times = zeros(1, N);
for ii = 1:N
    tic
    C = mtimesx(A,B);
    times(ii) = toc;
end

times = times(101:end) * 1e3;

plot(times);
grid on;
title(median(times));

蟒蛇

import timeit
import numpy as np
import matplotlib.pyplot as plt


N = 1000 + 100  # a few initial runs to be trimmed off at the end

a = 100
b = 30
c = 40
d = 50
A = np.arange(a * b * c).reshape([a, b, c])
B = np.arange(a * c * d).reshape([a, c, d])
C = np.empty(a * b * d).reshape([a, b, d])

times = np.empty(N)

for i in range(N):
    start = timeit.default_timer()
    C = A @ B
    times[i] = timeit.default_timer() - start

times = times[101:] * 1e3

plt.plot(times, linewidth=0.5)
plt.grid()
plt.title(np.median(times))
plt.show()

我的Python是使用OpenBLAS从pip安装的默认Python。
我正在运行英特尔NUC i3。

MATLAB代码在1ms运行，而Python在5.8ms运行，我无法弄清楚为什么，因为它们似乎都在使用BLAS。

编辑

来自Anaconda：

In [7]: np.__config__.show()
mkl_info:
    libraries = ['mkl_rt']
    library_dirs = [...]
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = [...]
blas_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = [...]
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = [...]
blas_opt_info:
    libraries = ['mkl_rt']
    library_dirs = [...]
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = [...]
lapack_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = [...]
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = [...]
lapack_opt_info:
    libraries = ['mkl_rt']
    library_dirs = [...]
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = [...]

从numpy使用pip

In [2]: np.__config__.show()
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
    library_dirs = [...]
    libraries = ['openblas']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    library_dirs = [...]
    libraries = ['openblas']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
    library_dirs = [...]
    libraries = ['openblas']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    library_dirs = [...]
    libraries = ['openblas']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]

编辑2我试图取代C = A @ B与np.matmul(A, B, out=C)得到了2倍时间差，例如11毫秒左右。 这真的很奇怪。

Answer 1

您的MATLAB代码使用浮点数组，但NumPy代码使用整数数组。 这使得时间有显着差异。 对于MATLAB和NumPy之间的“苹果到苹果”比较，Python / NumPy代码也必须使用浮点数组。

然而，这不是唯一重要的问题。 在NumPy github网站上的问题7569 （以及问题8957中）中讨论的NumPy存在缺陷。 “堆叠”数组的矩阵乘法不使用快速BLAS例程来执行乘法。 这意味着具有两个以上维度的数组的乘法可能比预期慢得多。

2-d数组的乘法确实使用快速例程，因此您可以通过在循环中将各个2-d数组相乘来解决此问题。 令人惊讶的是，尽管有Python循环的开销，但在很多情况下，它比@ ， matmul或einsum应用于完整堆叠阵列。

这是NumPy问题中显示的函数的变体，它在Python循环中执行矩阵乘法：

def xmul(A, B):
    """
    Multiply stacked matrices A (with shape (s, m, n)) by stacked
    matrices B (with shape (s, n, p)) to produce an array with
    shape (s, m, p).

    Mathematically equivalent to A @ B, but faster in many cases.

    The arguments are not validated.  The code assumes that A and B
    are numpy arrays with the same data type and with shapes described
    above.
    """
    out = np.empty((a.shape[0], a.shape[1], b.shape[2]), dtype=a.dtype)
    for j in range(a.shape[0]):
        np.matmul(a[j], b[j], out=out[j])
    return out

我的NumPy安装也使用MKL（它是Anaconda发行版的一部分）。 下面是A @ B和xmul(A, B)的时序比较，使用浮点值数组：

In [204]: A = np.random.rand(100, 30, 40)

In [205]: B = np.random.rand(100, 40, 50)

In [206]: %timeit A @ B
4.76 ms ± 6.37 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [207]: %timeit xmul(A, B)
582 µs ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

即使xmul使用Python循环，它也只需要A @ B 1/8。

Answer 2

我认为这是内存排序的问题。 Matlab的zeros(a, b, c)就像numpy的zeros((a, b, c), order='F') ，这不是默认值。

当然，正如您已经确定的那样， @在不同的轴上运行到mtimesx 。 为了使比较公平，您应该确保您的数组处于matlab顺序，然后转置以处理语义差异

# note: `order` in reshape actually changes the resulting array data,
# not just its memory layout
A = np.arange(a * b * c).reshape([b, c, a], order='F').transpose((2, 0, 1))
B = np.arange(a * c * d).reshape([c, d, a], order='F').transpose((2, 0, 1))

Answer 3

你能再试一次最近发布的NumPy 1.16吗？ 我们重构matmul以使用BLAS作为内部二维，这应该加速代码。

MATLAB矩阵乘法性能比NumPy快5倍

问题描述

3 个解决方案

解决方案1
6 已采纳 2018-10-20 09:43:52

解决方案2
1 2018-10-18 05:45:40

解决方案3
0 2018-12-21 12:24:50

MATLAB矩阵乘法性能比NumPy快5倍

问题描述

3 个解决方案

解决方案1 6 已采纳 2018-10-20 09:43:52

解决方案2 1 2018-10-18 05:45:40

解决方案3 0 2018-12-21 12:24:50

解决方案1
6 已采纳 2018-10-20 09:43:52

解决方案2
1 2018-10-18 05:45:40

解决方案3
0 2018-12-21 12:24:50