从文件中直接读取密集矩阵到稀疏的numpy数组中？

Question

我有一个以制表符分隔的格式存储在文本文件中的矩阵。 它被密集存储，但是我知道它非常稀疏。 我想将此矩阵加载为Python的稀疏格式之一。 矩阵非常大，因此执行scipy.loadtxt(...) ，然后将所得的密集数组转换为稀疏格式将在中间步骤中占用太多RAM内存，因此这不是一个选择。

Answer 1

loadtxt可用于打开的文件或任何可打开文件的可迭代文件。

因此，一种选择是打开文件，然后loadtxt块上执行loadtxt 。 然后将所得数组转换为稀疏数组。 将这些稀疏矩阵收集到一个列表中，然后使用block格式将它们组合成一个矩阵。

我没有使用太多的block格式，但我认为它将正确处理此任务。 在掩盖block收集每个block的coo属性（ data ， rows和cols ），并将它们合并为3个主要coo属性。

在loadtxt的封面下， loadtxt仅读取每一行，将其解析为数组或列表； 将所有这些行收集到一个列表中，最后将该嵌套列表传递给np.array() 。

因此，您可以读取每一行，将其解析为值的列表或数组，找到非零值，然后组装相关的coo数组。

通常通过组装data i ， j 1d数组，然后调用coo_matrix((data,(i,j)),...)来创建大型稀疏矩阵。 这是您需要使用此CSV数据的一种方法。

这是一种逐行方法，等效于在1个行块上使用loadtxt ：

测试文本列表，等效于文件：

In [840]: txt=b"""1,0,0,2,3
0,0,0,0,0
4,0,0,0,0
0,0,0,3,0
""".splitlines()
In [841]: 
In [841]: np.loadtxt(txt,delimiter=',',dtype=int)
Out[841]: 
array([[1, 0, 0, 2, 3],
       [0, 0, 0, 0, 0],
       [4, 0, 0, 0, 0],
       [0, 0, 0, 3, 0]])

逐行处理

In [842]: ll=[]
In [843]: for line in txt:
    ll.append(np.loadtxt([line],delimiter=','))
   .....:     
In [844]: ll
Out[844]: 
[array([ 1.,  0.,  0.,  2.,  3.]),
 array([ 0.,  0.,  0.,  0.,  0.]),
 array([ 4.,  0.,  0.,  0.,  0.]),
 array([ 0.,  0.,  0.,  3.,  0.])]

现在将每个数组转换为coo矩阵：

In [845]: lc=[[sparse.coo_matrix(l)] for l in ll]
In [846]: lc
Out[846]: 
[[<1x5 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in COOrdinate format>],
 [<1x5 sparse matrix of type '<class 'numpy.float64'>'
    with 0 stored elements in COOrdinate format>],
 [<1x5 sparse matrix of type '<class 'numpy.float64'>'
    with 1 stored elements in COOrdinate format>],
 [<1x5 sparse matrix of type '<class 'numpy.float64'>'
    with 1 stored elements in COOrdinate format>]]

和组装与列表bmat （A“盖”的bsr_matrix ）：

In [847]: B=sparse.bmat(lc)
In [848]: B
Out[848]: 
<4x5 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in COOrdinate format>
In [849]: B.A
Out[849]: 
array([[ 1.,  0.,  0.,  2.,  3.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 4.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  3.,  0.]])

sparse.coo_matrix(l)是将每一行压缩为bmat兼容对象的简便方法。

要分两行处理文本：

In [874]: ld=[]
In [875]: for i in range(0,4,2):
    arr = np.loadtxt(txt[i:i+2],delimiter=',')
    ld.append([sparse.coo_matrix(arr)])
   .....:     
In [876]: ld
Out[876]: 
[[<2x5 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in COOrdinate format>],
 [<2x5 sparse matrix of type '<class 'numpy.float64'>'
    with 2 stored elements in COOrdinate format>]]

像以前一样喂sparse.bmat 。

从文件中直接读取密集矩阵到稀疏的numpy数组中？

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-03-03 17:45:49

从文件中直接读取密集矩阵到稀疏的numpy数组中？

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-03-03 17:45:49

解决方案1
2 已采纳 2016-03-03 17:45:49