[英]How do I vectorize this code that produces a scipy sparse matrix?
Let me start off by explaining what I want to do. 让我首先解释一下我想做什么。 I am trying to build a recommendation system based off of
m
packages, each with n
features, stored in an mxn
sparse matrix X
. 我正在尝试基于存储在
mxn
稀疏矩阵X
的m
包(每个都有n
功能)构建推荐系统。 To do this, I'm attempting to run kNN to get the k
closest matches for a packages. 为此,我尝试运行kNN以获取软件包的
k
最接近的匹配项。 I want to build an mxm
sparse matrix K
where K[i, j]
is the dot product of rows X[i]
and X[j]
if X[j]
was a package returned by kNN for X[i]
, otherwise 0. 我想建立一个
mxm
稀疏矩阵K
,如果X[j]
是kNN为X[i]
返回的包,则K[i, j]
是行X[i]
和X[j]
的点积,否则为0 。
Here is the code I've written: 这是我编写的代码:
X = ...
knn = NearestNeighbors(n_neighbors=self.n_neighbors, metric='l2')
knn.fit(X)
knn_indices = knn.kneighbors(X, return_distance=False)
m, k = X.shape[0], self.n_neighbors
K = lil_matrix((m, m))
for i, indices in enumerate(knn_indices):
xi = X.getrow(i)
for j in indices:
xj = X.getrow(j)
K[i, j] = xi.dot(xj.T)[0, 0]
I'm trying to figure out how to make this more efficient. 我试图弄清楚如何使其更有效。 In my scenario,
m
is ~1.2 million, n
is ~50000, and k
is 500, so perf is very important. 在我的方案中,
m
为〜120万, n
为〜50000, k
为500,因此性能非常重要。
The last part where I populate K
is the bottleneck of my program. 我填充
K
的最后一部分是程序的瓶颈。 getrow
seems to perform very poorly; getrow
似乎很差; according to the scipy docs, it makes a copy of the row, so getrow
call could be copying up to 50k elements each time it's called. 根据scipy docs的说法,它会复制行,因此
getrow
调用每次被调用时最多可以复制5万个元素。 Also, in the innermost loop I can't figure out how to get back a scalar for dot
instead of creating a whole new 1x1
sparse matrix. 另外,在最内层的循环中,我无法弄清楚如何为
dot
取标量,而不是创建一个全新的1x1
稀疏矩阵。
How can I avoid these problems and speed up/vectorize the last part of this code? 如何避免这些问题并加速/向量化此代码的最后一部分? Thanks.
谢谢。
In [21]: from scipy import sparse
In [22]: M = sparse.random(10,10,.2,'csr')
In [23]: M
Out[23]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 20 stored elements in Compressed Sparse Row format>
Looking a MA
, I selected this small knn_indices
array for testing: 看一个
MA
,我选择了这个小的knn_indices
数组进行测试:
In [45]: knn = np.array([[4],[2],[],[1,3]])
Your double loop: 您的双循环:
In [46]: for i, indices in enumerate(knn):
...: xi = M[i,:]
...: for j in indices:
...: xj = M[j,:]
...: print((xi*xj.T).A)
...:
[[0.35494592]]
[[0.]]
[[0.08112133]]
[[0.56905781]]
The inner loop can be condensed: 内循环可以压缩:
In [47]: for i, indices in enumerate(knn):
...: xi = M[i,:]
...: xj = M[indices,:]
...: print((xi*xj.T).A)
...:
[[0.35494592]]
[[0.]]
[]
[[0.08112133 0.56905781]]
and with the assignment: 并分配:
In [49]: k = sparse.lil_matrix((4,5))
In [50]: for i, indices in enumerate(knn):
...: xi = M[i,:]
...: for j in indices:
...: xj = M[j,:]
...: k[i,j] = (xi*xj.T)[0,0]
...:
...:
In [51]: k.A
Out[51]:
array([[0. , 0. , 0. , 0. , 0.35494592],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0.08112133, 0. , 0.56905781, 0. ]])
The second loop with 第二个循环
k[i,indices] = (xi*xj.T)
does the same thing. 做同样的事情。
It may be possible to do something with the i
loop as well, but this is at least a start. 也可以对
i
循环执行某些操作,但这至少是一个开始。
That knn
doesn't need to an array. 该
knn
不需要数组。 With differing inner list lengths it's an object dtype anyways. 内部列表长度不同,无论如何它都是对象dtype。 Better leave it as list.
最好将其保留为列表。
An alternative to filling this lil
matrix, would be to accumulate i
, indices
and the dot product in coo
style arrays. 填充此
lil
矩阵的替代方法是将i
, indices
和点积以coo
样式数组进行累积。
In [64]: r,c,d = [],[],[]
In [65]: for i, indices in enumerate(knn):
...: xi = M[i,:]
...: xj = M[indices,:]
...: t = (xi*xj.T).data
...: if len(t)>0:
...: r.extend([i]*len(indices))
...: c.extend(indices)
...: d.extend(t)
...:
In [66]: r,c,d
Out[66]:
([0, 3, 3],
[4, 1, 3],
[0.3549459176547072, 0.08112132851228658, 0.5690578146292733])
In [67]: sparse.coo_matrix((d,(r,c))).A
Out[67]:
array([[0. , 0. , 0. , 0. , 0.35494592],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0.08112133, 0. , 0.56905781, 0. ]])
In my test case the 2nd row doesn't have any nonzero values, requiring an extra test in the loop. 在我的测试案例中,第二行没有任何非零值,因此需要在循环中进行额外的测试。 I don't know if this is any faster than the
lil
approach. 我不知道这是否比
lil
方法要快。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.