简体   繁体   English

使用 3d arrays 的高效 numpy 行矩阵乘法

[英]Efficient numpy row-wise matrix multiplication using 3d arrays

I have two 3d arrays of shape (N, M, D) and I want to perform an efficient row wise (over N) matrix multiplication such that the resulting array is of shape (N, D, D).我有两个形状为 (N, M, D) 的 3d arrays ,我想执行有效的逐行(在 N 上)矩阵乘法,使得结果数组的形状为 (N, D, D)。

An inefficient code sample showing what I try to achieve is given by:显示我尝试实现的目标的低效代码示例如下所示:

N = 100
M = 10
D = 50
arr1 = np.random.normal(size=(N, M, D))
arr2 = np.random.normal(size=(N, M, D))
result = []
for i in range(N):
    result.append(arr1[i].T @ arr2[i])
result = np.array(result)

However, this application is quite slow for large N due to the loop.但是,由于循环,此应用程序对于大 N 非常慢。 Is there a more efficient way to achieve this computation without using loops?有没有更有效的方法来实现这种计算而不使用循环? I already tried to find a solution via tensordot and einsum to no avail.我已经尝试通过 tensordot 和 einsum 找到解决方案,但无济于事。

The vectorization solution is to swap the last two axes of arr1 :矢量化解决方案是交换arr1的最后两个轴:

>>> N, M, D = 2, 3, 4
>>> np.random.seed(0)
>>> arr1 = np.random.normal(size=(N, M, D))
>>> arr2 = np.random.normal(size=(N, M, D))
>>> arr1.transpose(0, 2, 1) @ arr2
array([[[ 6.95815626,  0.38299107,  0.40600482,  0.35990016],
        [-0.95421604, -2.83125879, -0.2759683 , -0.38027618],
        [ 3.54989101, -0.31274318,  0.14188485,  0.19860495],
        [ 3.56319723, -6.36209602, -0.42687188, -0.24932248]],

       [[ 0.67081341, -0.08816343,  0.35430089,  0.69962394],
        [ 0.0316968 ,  0.15129449, -0.51592291,  0.07118177],
        [-0.22274906, -0.28955683, -1.78905988,  1.1486345 ],
        [ 1.68432706,  1.93915798,  2.25785798, -2.34404577]]])

A simple benchmark for the super N:超级 N 的简单基准:

In [225]: arr1.shape
Out[225]: (100000, 10, 50)

In [226]: %%timeit
     ...: result = []
     ...: for i in range(N):
     ...:     result.append(arr1[i].T @ arr2[i])
     ...: result = np.array(result)
     ...:
     ...:
12.4 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [227]: %timeit arr1.transpose(0, 2, 1) @ arr2
906 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For interest, I wondered about the loop overhead, which I guess is minimal compared to the matrix multiplication;出于兴趣,我想知道循环开销,我猜与矩阵乘法相比,它是最小的; but in particular, the loop overhead is minimal to the potential reallocation of the list memory, which with N = 10000 could be significant.但特别是,对于列表 memory 的潜在重新分配,循环开销是最小的,这在N = 10000时可能很重要。

Using a pre-allocated array instead of a list, I compared the loop result and the solution provided by Mechanic Pig , and achieved the following results on my machine:使用预先分配的数组而不是列表,我比较了循环结果和Mechanic Pig 提供的解决方案,并在我的机器上取得了以下结果:

In [10]: %timeit result1 = arr1.transpose(0, 2, 1) @ arr2
33.7 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

versus相对

In [14]: %%timeit
    ...: result = np.empty_like(arr1, shape=(N, D, D))
    ...: for i in range(N):
    ...:     result[i, ...] = arr1[i].T @ arr2[i]
    ...:
    ...:
48.5 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The pure NumPy solution is still faster, so that's good, but only by a factor of about 1.5.纯 NumPy 解决方案仍然更快,所以这很好,但只有大约 1.5 倍。 Not too bad.还不错。 Depending on the needs, the loop may be clearer as to what it intents (and easier to modify, in case there's a need for an if-statement or other shenigans).根据需要,循环可能更清楚它的意图(并且更容易修改,以防需要 if 语句或其他恶作剧)。
And naturally, a simple comment above the faster solution can easily point out what it actually replaces.当然,在更快的解决方案上方的简单注释可以很容易地指出它实际替换的内容。


Following the comments to this answer by Mechanic Pig, I've added below the timing results of a loop without preallocating an array (but with a preallocated list) and without conversion to a NumPy array.根据 Mechanic Pig 对这个答案的评论,我在没有预分配数组(但带有预分配列表)且没有转换为 NumPy 数组的情况下添加了循环的计时结果下方。 Mainly so the results are compared for the same machine:主要是比较同一台机器的结果:

In [11]: %%timeit
    ...: result = [None] * N
    ...: for i in range(N):
    ...:     result.append(arr1[i].T @ arr2[i])
    ...:
49.5 ms ± 672 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So interestingly, this results, without conversion, is (a tiny bit) slower than the one with a pre-allocated array and directly assigning into the array.如此有趣的是,这个结果在没有转换的情况下,比具有预分配数组并直接分配到数组中的结果(一点点)慢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM