简体   繁体   English

稀疏矩阵中的有效访问

[英]Efficient accessing in sparse matrices

I'm working with recommender systems but I'm struggling with the access times of the scipy sparse matrices. 我正在使用推荐系统,但是我在为稀疏矩阵的访问时间而苦苦挣扎。

In this case, I'm implementing TrustSVD so I need an efficient structure to operate both in columns and rows (CSR, CSC). 在这种情况下,我正在实现TrustSVD,所以我需要一个有效的结构来在列和行(CSR,CSC)中进行操作。 I've thought about using both structures, dictionaries,... but either way this is always too slow, especially compared with the numpy matrix operations. 我曾考虑过同时使用结构,字典等。但是无论哪种方式,这总是太慢,特别是与numpy矩阵运算相比。

for u, j in zip(*ratings.nonzero()):
    items_rated_by_u = ratings[u, :].nonzero()[1]
    users_who_rated_j = ratings[:, j].nonzero()[0]
    # More code...

Extra: Each loop takes around 0.033s, so iterating once through 35,000 ratings means to wait 19min per iteration (SGD) and for a minimum of 25 iterations we're talking about 8h. 额外:每个循环大约需要0.033s,因此要对35,000个额定值进行一次迭代,则意味着要等待19分钟的每次迭代(SGD),而对于至少25次迭代,我们所说的是8小时。 Moreover, here I'm just talking about accessing, if I include the factorization part it would take around 2 days. 此外,在这里我只是在谈论访问,如果我包括分解部分,则大约需要2天。

When you index a sparse matrix, especially just asking for a row or column, it not only has to select the values, but it also has to construct a new sparse matrix. 当您为稀疏矩阵建立索引时,尤其是仅要求行或列时,它不仅必须选择值,而且还必须构造一个新的稀疏矩阵。 np.ndarray construction is done in compiled code, but most of the sparse construction is pure Python. np.ndarray构造是在编译后的代码中完成的,但是大多数稀疏构造都是纯Python。 The nonzero()[1] construct requires converting the matrix to coo format and picking the row and col attributes (look at its code). nonzero()[1]构造要求将矩阵转换为coo格式,并选择rowcol属性(请参阅其代码)。

I think you could access your row columns faster by looking at the rows attribute of the lil format, or its transpose: 我认为您可以通过查看lil格式的rows属性或其转置来更快地访问行列:

In [418]: sparse.lil_matrix(np.matrix('0,1,0;1,0,0;0,1,1'))
Out[418]: 
<3x3 sparse matrix of type '<class 'numpy.int32'>'
    with 4 stored elements in LInked List format>
In [419]: M=sparse.lil_matrix(np.matrix('0,1,0;1,0,0;0,1,1'))
In [420]: M.A
Out[420]: 
array([[0, 1, 0],
       [1, 0, 0],
       [0, 1, 1]], dtype=int32)
In [421]: M.rows
Out[421]: array([[1], [0], [1, 2]], dtype=object)
In [422]: M[1,:].nonzero()[1]
Out[422]: array([0], dtype=int32)
In [423]: M[2,:].nonzero()[1]
Out[423]: array([1, 2], dtype=int32)
In [424]: M.T.rows
Out[424]: array([[1], [0, 2], [2]], dtype=object)

You could also access these values in the csr format, but it's a bit more complicated 您也可以以csr格式访问这些值,但这有点复杂

In [425]: M.tocsr().indices
Out[425]: array([1, 0, 1, 2], dtype=int32)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM