简体   繁体   English

如何有效地加载稀疏矩阵?

[英]How to load a sparse matrix efficiently?

Given a file with this structure: 给定具有此结构的文件:

  • Single column lines are keys 单列线是键
  • Non-zero values of the keys 键的非零值

For example: 例如:

abc
ef 0.85
kl 0.21
xyz 0.923
cldex 
plax 0.123
lion -0.831

How to create a sparse matrix, csr_matrix ? 如何创建稀疏矩阵, csr_matrix

('abc', 'ef') 0.85
('abc', 'kl') 0.21
('abc', 'xyz') 0.923
('cldex', 'plax') 0.123
('cldex', 'lion') -0.31

I've tried: 我试过了:

from collections import defaultdict

x = """abc
ef  0.85
kl  0.21
xyz 0.923
cldex 
plax    0.123
lion    -0.831""".split('\n')

k1 = ''
arr = defaultdict(dict)
for line in x:
    line = line.strip().split('\t')
    if len(line) == 1:
        k1 = line[0]
    else:
        k2, v = line
        v = float(v)
        arr[k1][k2] = v

[out] [OUT]

>>> arr
defaultdict(dict,
            {'abc': {'ef': 0.85, 'kl': 0.21, 'xyz': 0.923},
             'cldex': {'plax': 0.123, 'lion': -0.831}})

Having the nested dict structure isn't as convenient as the scipy sparse matrix structure. 具有嵌套的dict结构不如scipy稀疏矩阵结构方便。

Is there a way to read the file in the given format above easily into any of the scipy sparse matrix object? 有没有办法轻松地将上面给定格式的文件读入任何scipy稀疏矩阵对象?

Converting @hpaulj's comment into answer, you can iteratively add to lists of row and column indices. 将@ hpaulj的注释转换为答案,您可以迭代地添加到行和列索引的列表中。 Later, factorise these using pd.factorize , np.unique , or sklearn 's LabelEncoder , and convert to a sparse coo_matrix . 之后,使用pd.factorizenp.uniquesklearnLabelEncoder这些进行分解,并转换为稀疏的coo_matrix

from scipy import sparse
import numpy as np
import pandas as pd

rows, cols, values = [], [], []
for line in x.splitlines():
   if ' ' not in line.strip():
       ridx = line
   else:
       cidx, value = line.strip().split()       
       rows.append(ridx)
       cols.append(cidx)
       values.append(value)

rows, rinv = pd.factorize(rows)
cols, cinv = pd.factorize(cols)

sp = sparse.coo_matrix((values, (rows, cols)), dtype=np.float32)
# sp = sparse.csr_matrix((np.array(values, dtype=np.float), (rows, cols)))

sp.toarray()
array([[ 0.85 ,  0.21 ,  0.923,  0.   ,  0.   ],
       [ 0.   ,  0.   ,  0.   ,  0.123, -0.831]], dtype=float32)

If required, you can use rinv and cinv to perform an inverse mapping (convert indices to strings). 如果需要,您可以使用rinvcinv执行逆映射(将索引转换为字符串)。

Currently, in version 0.23, pandas have implemented sparse versions of Series and Data-Frames. 目前,在0.23版本中,pandas已经实现了Series和Data-Frames的稀疏版本。 Coincidentally your data can be seen as a Series with multi-level index, so you could exploit this fact to build the sparse matrix. 巧合的是,您的数据可以看作是具有多级索引的系列,因此您可以利用这一事实来构建稀疏矩阵。 In addition, if consistent, your format can be read using a few lines of pandas, for example: 此外,如果一致,您可以使用几行pandas读取您的格式,例如:

import numpy as np
import pandas as pd
from io import StringIO

lines = StringIO("""abc
ef  0.85
kl  0.21
xyz 0.923
cldex
plax    0.123
lion    -0.831""")

# load Series
s = pd.read_csv(lines, delim_whitespace=True, header=None, names=['k', 'v'])
s = s.assign(k2=pd.Series(np.where(np.isnan(s.v), s.k, np.nan)).ffill())
result = s[~np.isnan(s.v)].set_index(['k2', 'k']).squeeze()

# convert to sparse matrix (csr)
ss = result.to_sparse()
coo, rows, columns = ss.to_coo(row_levels=['k'], column_levels=['k2'], sort_labels=True)
print(coo.tocsr())

Output 产量

  (0, 0)    0.85
  (1, 0)    0.21
  (2, 1)    -0.831
  (3, 1)    0.12300000000000001
  (4, 0)    0.9229999999999999

The to_coo method not only returns the matrix, but also the columns and row labels, hence doing also the inverse mapping. to_coo方法不仅返回矩阵,还返回列和行标签,因此也执行逆映射。 In the above example returns the following: 在上面的示例中返回以下内容:

['ef', 'kl', 'lion', 'plax', 'xyz']
['abc', 'cldex']

Where 'ef' corresponds to index 0 of the rows and 'abc' corresponds to index 0 of the columns. 其中'ef'对应于行的索引0'abc'对应于列的索引0

Given that you have the dict 鉴于你有这个词

dox = {'abc': {'ef': 0.85, 'kl': 0.21, 'xyz': 0.923},'cldex': {'plax': 0.123, 'lion': -0.831}}

this should help you take it to a sparsematrix: 这应该可以帮助你把它带到稀疏矩阵:

indptr = [0]
indices = []
data = []
vocabulary = {}

for d in dox:
     for term in dox[d]:
         index = vocabulary.setdefault(term, len(vocabulary))
         indices.append(index)
         data.append(dox[d][term])
         indptr.append(len(indices))

mat = csr_matrix((data, indices, indptr), dtype=float)

This utilizes scipy's example for an incremental matrix build. 这利用了scipy的例子来进行增量矩阵构建。 Here is the output: 这是输出:

mat.todense()

在此输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM