简体   繁体   English

Python-带有稀疏稀疏矩阵的高效函数

[英]Python - Efficient Function with scipy sparse Matrices

for a project, I need an efficient function in python that solves to following task: 对于一个项目,我需要一个高效的python函数来解决以下任务:

Given a very large List X of long sparse Vectors (=> big sparse Matrix) and another Matrix Y that contains a single Vector y, I want a List of "distances", that y has to every Element of X. Hereby the "distance" is defined like this: 给定一个非常大的长稀疏向量列表X(=>大稀疏矩阵)和另一个包含单个向量y的矩阵Y,我想要一个“距离”列表,其中y对X的每个元素都具有y。因此,“距离” ”的定义如下:

Compare each Element of the two Vectors, always take the lower one and sum them up. 比较两个向量的每个元素,始终取下一个向量并将其求和。

Example: 例:

X = [[0,0,2],   
     [1,0,0],
     [3,1,0]]

Y = [[1,0,2]]

The function should return dist = [2,1,1] 该函数应返回dist = [2,1,1]

In my project, both X and Y contain a lot of zeros and come in as an instance of: 在我的项目中,X和Y都包含很多零,并作为以下项的一个实例出现:

<class 'scipy.sparse.csr.csr_matrix'>

So far so good and I managed to write a functions that solves this task, but is very slow and horrible inefficient. 到目前为止,还算不错,我设法编写了一个解决该任务的函数,但是效率很低,而且可怕的效率低下。 I need some tips on how to efficienty process/iterate the sparse Matrices. 我需要一些有关如何高效处理/迭代稀疏矩阵的技巧。 This is my function: 这是我的功能:

def get_distances(X, Y):
   Ret=[]
   rows, cols = X.shape  

   for i in range(0,rows):
       dist = 0                
       sample = X.getrow(i).todense()
       test = Y.getrow(0).todense()    
       rows_s, cols_s = sample.shape     
       rows_t, cols_t = test.shape 

       for s,t in zip(range(0, cols_s), range(0, cols_t)):
           dist += min(sample[0,s], test[0,t])

       X_ret.append([dist])    

   return ret

To do my Operations, I convert the sparse matrices to dense matrices which is of course horrible, but I did not know how to do it better. 为了进行操作,我将稀疏矩阵转换为密集矩阵,这当然很可怕,但是我不知道如何做得更好。 Do you know how to improve my code and make the function faster? 您知道如何改进代码并使功能更快吗?

Thank you a lot! 非常感谢!

I revised your function and ran it in 我修改了您的功能并在其中运行

import numpy as np
from scipy import sparse

def get_distances(X, Y):
   ret=[]
   for row in X:            
       sample = row.A
       test = Y.getrow(0).A   
       dist = np.minimum(sample[0,:], test[0,:]).sum()
       ret.append(dist)    
   return ret

X = [[0,0,2],   
     [1,0,0],
     [3,1,0]]

Y = [[1,0,2]]

XM = sparse.csr_matrix(X)
YM = sparse.csr_matrix(Y)

print( get_distances(XM,YM))

print (np.minimum(XM.A, YM.A).sum(axis=1))

producing 生产

1255:~/mypy$ python3 stack37056258.py 
[2, 1, 1]
[2 1 1]

np.minimum takes element wise minimum of two arrays (may be 2d), so I don't need to iterate on columns. np.minimum取两个数组的元素明智最小值(可能是2d),因此我不需要在列上进行迭代。 I also don't need to use indexing. 我也不需要使用索引。

minimum is also implemented for sparse matrices, but I get a segmenation error when I try to apply it to your X (with 3 rows) and Y (with 1). 稀疏矩阵也实现了minimum ,但是当我尝试将其应用于您的X (3行)和Y (1)时,出现segmenation错误。 If they are the same size this works: 如果它们的大小相同,则可以进行以下操作:

Ys = sparse.vstack((YM,YM,YM))
print(Ys.shape)
print (XM.minimum(Ys).sum(axis=1))

Converting the single row matrix to an array also gets around the error - because it ends up using the dense version, np.minimum(XM.todense(), YM.A) . 将单行矩阵转换为数组也会避免错误-因为最终使用密集版本np.minimum(XM.todense(), YM.A)

print (XM.minimum(YM.A).sum(axis=1))

When I try other element by element operations on these 2 matrices I get ValueError: inconsistent shapes , eg XM+YM , or XM<YM . 当我在这两个矩阵上尝试逐个元素进行其他元素运算时,出现ValueError: inconsistent shapes ,例如XM+YMXM<YM Looks like sparse does not implement broadcasting as numpy arrays does. 看起来稀疏不像numpy数组那样实现广播。

======================= =======================

Comparison of ways of replicating a 1 row sparse matrix many times 多次复制1行稀疏矩阵的方式比较

In [271]: A=sparse.csr_matrix([0,1,0,0,1])

In [272]: timeit sparse.vstack([A]*3000).A
10 loops, best of 3: 32.3 ms per loop

In [273]: timeit sparse.kron(A,np.ones((3000,1),int)).A
1000 loops, best of 3: 1.27 ms per loop

For many times, kron is better than vstack . 对于很多次, kron优于vstack

======================= =======================

There's an overlap in issues with Scipy sparse matrix alternative for getrow() Scipy稀疏矩阵替代getrow()的问题存在重叠

Try below code for sparse matrix: 尝试下面的代码来处理稀疏矩阵:

from scipy.sparse import csr_matrix, vstack
X = csr_matrix([[0,0,2],[1,0,0],[3,1,0]])
Y = csr_matrix([[1,0,2]])
def matrix_dist(x,y):
    y=vstack([y]*x.shape[1])
    return (((x+y)-(x-y).multiply((x-y).sign())).sum(1)/2).A.ravel()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM