简体   繁体   English

将python稀疏矩阵dict转换为scipy稀疏矩阵

[英]Converting python sparse matrix dict to scipy sparse matrix

I am using python scikit-learn for document clustering and I have a sparse matrix stored in a dict object: 我正在使用python scikit-learn进行文档聚类,并且我有一个存储在dict对象中的稀疏矩阵:

For example: 例如:

doc_term_dict = { ('d1','t1'): 12,             \
                  ('d2','t3'): 10,             \
                  ('d3','t2'):  5              \
                  }                            # from mysql data table 
<type 'dict'>

I want to use scikit-learn to do the clustering where the input matrix type is scipy.sparse.csr.csr_matrix 我想使用scikit-learn来进行聚类,其中输入矩阵类型是scipy.sparse.csr.csr_matrix

Example: 例:

(0, 2164)   0.245793088885
(0, 2076)   0.205702177467
(0, 2037)   0.193810934784
(0, 2005)   0.14547028437
(0, 1953)   0.153720023365
...
<class 'scipy.sparse.csr.csr_matrix'>

I can't find a way to convert dict to this csr-matrix (I have never used scipy .) 我找不到将dict转换为csr-matrix的方法(我从未使用过scipy 。)

Pretty straightforward. 非常直截了当。 First read the dictionary and convert the keys to the appropriate row and column. 首先读取字典并将键转换为适当的行和列。 Scipy supports (and recommends for this purpose) the COO-rdinate format for sparse matrices. Scipy支持(并为此目的推荐)稀疏矩阵的COO-rdinate格式

Pass it data , row , and column , where A[row[k], column[k] = data[k] (for all k) defines the matrix. 传递datarowcolumn ,其中A[row[k], column[k] = data[k] (对于所有k)定义矩阵。 Then let Scipy do the conversion to CSR. 然后让Scipy转换为CSR。

Please check, that I have rows and columns in the way you want them, I might have them transposed. 请检查,我有你想要的行和列,我可能会将它们转换。 I also assumed that the input would be 1-indexed. 我还假设输入是1索引的。

My code below prints: 我的代码打印:

(0, 0)        12
(1, 2)        10
(2, 1)        5

Code: 码:

#!/usr/bin/env python3
#http://stackoverflow.com/questions/26335059/converting-python-sparse-matrix-dict-to-scipy-sparse-matrix

from scipy.sparse import csr_matrix, coo_matrix

def convert(term_dict):
    ''' Convert a dictionary with elements of form ('d1', 't1'): 12 to a CSR type matrix.
    The element ('d1', 't1'): 12 becomes entry (0, 0) = 12.
    * Conversion from 1-indexed to 0-indexed.
    * d is row
    * t is column.
    '''
    # Create the appropriate format for the COO format.
    data = []
    row = []
    col = []
    for k, v in term_dict.items():
        r = int(k[0][1:])
        c = int(k[1][1:])
        data.append(v)
        row.append(r-1)
        col.append(c-1)
    # Create the COO-matrix
    coo = coo_matrix((data,(row,col)))
    # Let Scipy convert COO to CSR format and return
    return csr_matrix(coo)

if __name__=='__main__':
    doc_term_dict = { ('d1','t1'): 12,             \
                ('d2','t3'): 10,             \
                ('d3','t2'):  5              \
                }   
    print(convert(doc_term_dict))

We can make @Unapiedra's (excellent) answer a little more sparse: 我们可以让@Unapiedra(优秀)的答案更加稀疏:

from scipy.sparse import csr_matrix
def _dict_to_csr(term_dict):
    term_dict_v = list(term_dict.itervalues())
    term_dict_k = list(term_dict.iterkeys())
    shape = list(repeat(np.asarray(term_dict_k).max() + 1,2))
    csr = csr_matrix((term_dict_v, zip(*term_dict_k)), shape = shape)
    return csr

Same as @carsonc, but for Python 3.X : 与@carsonc相同,但对于Python 3.X:

from scipy.sparse import csr_matrix
def _dict_to_csr(term_dict):
    term_dict_v = term_dict.values()
    term_dict_k = term_dict.keys()
    term_dict_k_zip = zip(*term_dict_k)
    term_dict_k_zip_list = list(term_dict_k_zip)

    shape = (len(term_dict_k_zip_list[0]), len(term_dict_k_zip_list[1]))
    csr = csr_matrix((list(term_dict_v), list(map(list, zip(*term_dict_k)))), shape = shape)
    return csr

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM