简体   繁体   English

如何将产生稀疏矩阵的代码矢量化?

[英]How do I vectorize this code that produces a scipy sparse matrix?

Let me start off by explaining what I want to do. 让我首先解释一下我想做什么。 I am trying to build a recommendation system based off of m packages, each with n features, stored in an mxn sparse matrix X . 我正在尝试基于存储在mxn稀疏矩阵Xm包(每个都有n功能)构建推荐系统。 To do this, I'm attempting to run kNN to get the k closest matches for a packages. 为此,我尝试运行kNN以获取软件包的k最接近的匹配项。 I want to build an mxm sparse matrix K where K[i, j] is the dot product of rows X[i] and X[j] if X[j] was a package returned by kNN for X[i] , otherwise 0. 我想建立一个mxm稀疏矩阵K ,如果X[j]是kNN为X[i]返回的包,则K[i, j]是行X[i]X[j]的点积,否则为0 。

Here is the code I've written: 这是我编写的代码:

X = ...
knn = NearestNeighbors(n_neighbors=self.n_neighbors, metric='l2')
knn.fit(X)
knn_indices = knn.kneighbors(X, return_distance=False)

m, k = X.shape[0], self.n_neighbors
K = lil_matrix((m, m))

for i, indices in enumerate(knn_indices):
    xi = X.getrow(i)
    for j in indices:
        xj = X.getrow(j)
        K[i, j] = xi.dot(xj.T)[0, 0]

I'm trying to figure out how to make this more efficient. 我试图弄清楚如何使其更有效。 In my scenario, m is ~1.2 million, n is ~50000, and k is 500, so perf is very important. 在我的方案中, m为〜120万, n为〜50000, k为500,因此性能非常重要。

The last part where I populate K is the bottleneck of my program. 我填充K的最后一部分是程序的瓶颈。 getrow seems to perform very poorly; getrow似乎很差; according to the scipy docs, it makes a copy of the row, so getrow call could be copying up to 50k elements each time it's called. 根据scipy docs的说法,它会复制行,因此getrow调用每次被调用时最多可以复制5万个元素。 Also, in the innermost loop I can't figure out how to get back a scalar for dot instead of creating a whole new 1x1 sparse matrix. 另外,在最内层的循环中,我无法弄清楚如何为dot取标量,而不是创建一个全新的1x1稀疏矩阵。

How can I avoid these problems and speed up/vectorize the last part of this code? 如何避免这些问题并加速/向量化此代码的最后一部分? Thanks. 谢谢。

In [21]: from scipy import sparse
In [22]: M = sparse.random(10,10,.2,'csr')
In [23]: M
Out[23]: 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 20 stored elements in Compressed Sparse Row format>

Looking a MA , I selected this small knn_indices array for testing: 看一个MA ,我选择了这个小的knn_indices数组进行测试:

In [45]: knn = np.array([[4],[2],[],[1,3]])

Your double loop: 您的双循环:

In [46]: for i, indices in enumerate(knn):
    ...:     xi = M[i,:]
    ...:     for j in indices:
    ...:         xj = M[j,:]
    ...:         print((xi*xj.T).A)
    ...:         
[[0.35494592]]
[[0.]]
[[0.08112133]]
[[0.56905781]]

The inner loop can be condensed: 内循环可以压缩:

In [47]: for i, indices in enumerate(knn):
    ...:     xi = M[i,:]
    ...:     xj = M[indices,:]
    ...:     print((xi*xj.T).A)
    ...:         
[[0.35494592]]
[[0.]]
[]
[[0.08112133 0.56905781]]

and with the assignment: 并分配:

In [49]: k = sparse.lil_matrix((4,5))
In [50]: for i, indices in enumerate(knn):
    ...:     xi = M[i,:]
    ...:     for j in indices:
    ...:         xj = M[j,:]
    ...:         k[i,j] = (xi*xj.T)[0,0]
    ...:         
    ...:         
In [51]: k.A
Out[51]: 
array([[0.        , 0.        , 0.        , 0.        , 0.35494592],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.08112133, 0.        , 0.56905781, 0.        ]])

The second loop with 第二个循环

 k[i,indices] = (xi*xj.T)

does the same thing. 做同样的事情。

It may be possible to do something with the i loop as well, but this is at least a start. 也可以对i循环执行某些操作,但这至少是一个开始。

That knn doesn't need to an array. knn不需要数组。 With differing inner list lengths it's an object dtype anyways. 内部列表长度不同,无论如何它都是对象dtype。 Better leave it as list. 最好将其保留为列表。

An alternative to filling this lil matrix, would be to accumulate i , indices and the dot product in coo style arrays. 填充此lil矩阵的替代方法是将iindices和点积以coo样式数组进行累积。

In [64]: r,c,d = [],[],[]
In [65]: for i, indices in enumerate(knn):
    ...:     xi = M[i,:]
    ...:     xj = M[indices,:]
    ...:     t = (xi*xj.T).data
    ...:     if len(t)>0:
    ...:         r.extend([i]*len(indices))
    ...:         c.extend(indices)
    ...:         d.extend(t)
    ...:         
In [66]: r,c,d
Out[66]: 
([0, 3, 3],
 [4, 1, 3],
 [0.3549459176547072, 0.08112132851228658, 0.5690578146292733])
In [67]: sparse.coo_matrix((d,(r,c))).A
Out[67]: 
array([[0.        , 0.        , 0.        , 0.        , 0.35494592],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.08112133, 0.        , 0.56905781, 0.        ]])

In my test case the 2nd row doesn't have any nonzero values, requiring an extra test in the loop. 在我的测试案例中,第二行没有任何非零值,因此需要在循环中进行额外的测试。 I don't know if this is any faster than the lil approach. 我不知道这是否比lil方法要快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM