简体   繁体   English

numpy的“矢量化”行式点积的运行速度比for循环慢

[英]Numpy “vectorized” row-wise dot product running slower than a for loop

Given a matrix A with shape (n,k) and a vector s of size n , I want to compute a matrix G with shape (k,k) as follows: 给定形状为(n,k)的矩阵A和大小为n的向量s ,我要计算形状为(k,k)的矩阵G ,如下所示:

G += s[i] * A[i].T * A[i] , for all i in {0,...,n-1} G + = s [i] * A [i] .T * A [i] ,对于{0,...,n-1}中的所有i

I tried to implement that using a for loop ( Method 1 ) and in a vectorized manner ( Method 2 ), but the for loop implementation is faster for large values of k (specially when k > 500 ). 我尝试使用for循环( 方法1 )和矢量化方式( 方法2 )来实现这一点,但是对于较大的k值(尤其是k> 500 ),for循环的实现更快。

The code was written as follows: 该代码编写如下:

import numpy as np
k = 200
n = 50000
A = np.random.randint(0, 1000, (n,k)) # generates random data for the matrix A (n,k)
G1 = np.zeros((k,k)) # initialize G1 as a (k,k) matrix
s = np.random.randint(0, 1000, n) * 1.0 # initialize a random vector of size n

# METHOD 1
for i in xrange(n):
    G1 += s[i] * np.dot(np.array([A[i]]).T, np.array([A[i]]))

# METHOD 2
G2 = np.dot(A[:,np.newaxis].T, s[:,np.newaxis]*A)
G2 = np.squeeze(G2) # reduces dimension from (k,1,k) to (k,k)

The matrices G1 and G2 are the same (they are the matrix G ), and the only difference is how they were computed. 矩阵G1和G2相同(它们是矩阵G ),唯一的区别是它们的计算方式。 Is there a more clever and efficient way to compute this? 有没有更聪明,更有效的方法来计算?

Finally, these are the times I got with random sizes for k and n : 最后,这是我获得kn随机大小的时间:

Test #: 1
k,n: (866, 45761)
Method1: 337.457569838s
Method2: 386.290487051s
--------------------
Test #: 2
k,n: (690, 48011)
Method1: 152.999140978s
Method2: 226.080267191s
--------------------
Test #: 3
k,n: (390, 5317)
Method1: 5.28722500801s
Method2: 4.86999702454s
--------------------
Test #: 4
k,n: (222, 5009)
Method1: 1.73456382751s
Method2: 0.929286956787s
--------------------
Test #: 5
k,n: (915, 16561)
Method1: 101.782826185s
Method2: 159.167108059s
--------------------
Test #: 6
k,n: (130, 11283)
Method1: 1.53138184547s
Method2: 0.64450097084s
--------------------
Test #: 7
k,n: (57, 37863)
Method1: 1.44776391983s
Method2: 0.494270086288s
--------------------
Test #: 8
k,n: (110, 34599)
Method1: 3.51851701736s
Method2: 1.61688089371s

Two much more improved versions would be - 还有两个改进的版本是-

(A.T*s).dot(A)
(A.T).dot(A*s[:,None])

Issue(s) with method2 : method2

With method2 , we are creating A[:,np.newaxis].T , which would be of shape (k,1,n) , that's a 3D array. 使用method2 ,我们将创建A[:,np.newaxis].T ,其形状为(k,1,n) ,即3D数组。 I think with a 3D array, np.dot goes into some kind of loop and isn't truly vectorized (source code could reveal more info here). 我认为使用3D数组时, np.dot会进入某种循环,并且没有真正向量化(源代码可以在此处显示更多信息)。

For such 3D tensor multiplications, it's better to use the tensor equivalent : np.tensordot . 对于此类3D张量乘法,最好使用张量等效项: np.tensordot Thus, an improved version of method2 becomes : 因此, method2的改进版本变为:

G2 = np.tensordot(A[:,np.newaxis].T, s[:,np.newaxis]*A, axes=((2),(0)))
G2 = np.squeeze(G2)

Since, we are sum-reducing just one axis from each of those inputs with np.tensordot , we don't really need tensordot here and simply np.dot on the squeezed-in version would suffice. 因为,我们使用np.tensordot从每个输入中仅sum-reducing一个轴的np.tensordot ,所以我们在这里实际上并不需要tensordot ,而仅在squeezed-in版本中使用np.dot就足够了。 This will lead us back to method4 . 这将使我们回到method4

Runtime test 运行时测试

Approaches - 方法-

def method1(A, s):
    G1 = np.zeros((k,k)) # initialize G1 as a (k,k) matrix
    for i in xrange(n):
        G1 += s[i] * np.dot(np.array([A[i]]).T, np.array([A[i]]))
    return G1

def method2(A, s):
    G2 = np.dot(A[:,np.newaxis].T, s[:,np.newaxis]*A)
    G2 = np.squeeze(G2) # reduces dimension from (k,1,k) to (k,k)
    return G2

def method3(A, s):
    return (A.T*s).dot(A)

def method4(A, s):
    return (A.T).dot(A*s[:,None])

def method2_improved(A, s):
    G2 = np.tensordot(A[:,np.newaxis].T, s[:,np.newaxis]*A, axes=((2),(0)))
    G2 = np.squeeze(G2)
    return G2

Timings and verification - 时间和验证-

In [56]: k = 200
    ...: n = 5000
    ...: A = np.random.randint(0, 1000, (n,k))
    ...: s = np.random.randint(0, 1000, n) * 1.0
    ...: 

In [72]: print np.allclose(method1(A, s), method2(A, s))
    ...: print np.allclose(method1(A, s), method3(A, s))
    ...: print np.allclose(method1(A, s), method4(A, s))
    ...: print np.allclose(method1(A, s), method2_improved(A, s))
    ...: 
True
True
True
True

In [73]: %timeit method1(A, s)
    ...: %timeit method2(A, s)
    ...: %timeit method3(A, s)
    ...: %timeit method4(A, s)
    ...: %timeit method2_improved(A, s)
    ...: 
1 loops, best of 3: 1.12 s per loop
1 loops, best of 3: 693 ms per loop
100 loops, best of 3: 8.12 ms per loop
100 loops, best of 3: 8.17 ms per loop
100 loops, best of 3: 8.28 ms per loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM