[英]How to load a sparse matrix efficiently?
Given a file with this structure: 给定具有此结构的文件:
For example: 例如:
abc
ef 0.85
kl 0.21
xyz 0.923
cldex
plax 0.123
lion -0.831
How to create a sparse matrix, csr_matrix
? 如何创建稀疏矩阵,
csr_matrix
?
('abc', 'ef') 0.85
('abc', 'kl') 0.21
('abc', 'xyz') 0.923
('cldex', 'plax') 0.123
('cldex', 'lion') -0.31
I've tried: 我试过了:
from collections import defaultdict
x = """abc
ef 0.85
kl 0.21
xyz 0.923
cldex
plax 0.123
lion -0.831""".split('\n')
k1 = ''
arr = defaultdict(dict)
for line in x:
line = line.strip().split('\t')
if len(line) == 1:
k1 = line[0]
else:
k2, v = line
v = float(v)
arr[k1][k2] = v
[out] [OUT]
>>> arr
defaultdict(dict,
{'abc': {'ef': 0.85, 'kl': 0.21, 'xyz': 0.923},
'cldex': {'plax': 0.123, 'lion': -0.831}})
Having the nested dict structure isn't as convenient as the scipy
sparse matrix structure. 具有嵌套的dict结构不如
scipy
稀疏矩阵结构方便。
Is there a way to read the file in the given format above easily into any of the scipy
sparse matrix object? 有没有办法轻松地将上面给定格式的文件读入任何
scipy
稀疏矩阵对象?
Converting @hpaulj's comment into answer, you can iteratively add to lists of row and column indices. 将@ hpaulj的注释转换为答案,您可以迭代地添加到行和列索引的列表中。 Later, factorise these using
pd.factorize
, np.unique
, or sklearn
's LabelEncoder
, and convert to a sparse coo_matrix
. 之后,使用
pd.factorize
, np.unique
或sklearn
的LabelEncoder
这些进行分解,并转换为稀疏的coo_matrix
。
from scipy import sparse
import numpy as np
import pandas as pd
rows, cols, values = [], [], []
for line in x.splitlines():
if ' ' not in line.strip():
ridx = line
else:
cidx, value = line.strip().split()
rows.append(ridx)
cols.append(cidx)
values.append(value)
rows, rinv = pd.factorize(rows)
cols, cinv = pd.factorize(cols)
sp = sparse.coo_matrix((values, (rows, cols)), dtype=np.float32)
# sp = sparse.csr_matrix((np.array(values, dtype=np.float), (rows, cols)))
sp.toarray()
array([[ 0.85 , 0.21 , 0.923, 0. , 0. ],
[ 0. , 0. , 0. , 0.123, -0.831]], dtype=float32)
If required, you can use rinv
and cinv
to perform an inverse mapping (convert indices to strings). 如果需要,您可以使用
rinv
和cinv
执行逆映射(将索引转换为字符串)。
Currently, in version 0.23, pandas have implemented sparse versions of Series and Data-Frames. 目前,在0.23版本中,pandas已经实现了Series和Data-Frames的稀疏版本。 Coincidentally your data can be seen as a Series with multi-level index, so you could exploit this fact to build the sparse matrix.
巧合的是,您的数据可以看作是具有多级索引的系列,因此您可以利用这一事实来构建稀疏矩阵。 In addition, if consistent, your format can be read using a few lines of pandas, for example:
此外,如果一致,您可以使用几行pandas读取您的格式,例如:
import numpy as np
import pandas as pd
from io import StringIO
lines = StringIO("""abc
ef 0.85
kl 0.21
xyz 0.923
cldex
plax 0.123
lion -0.831""")
# load Series
s = pd.read_csv(lines, delim_whitespace=True, header=None, names=['k', 'v'])
s = s.assign(k2=pd.Series(np.where(np.isnan(s.v), s.k, np.nan)).ffill())
result = s[~np.isnan(s.v)].set_index(['k2', 'k']).squeeze()
# convert to sparse matrix (csr)
ss = result.to_sparse()
coo, rows, columns = ss.to_coo(row_levels=['k'], column_levels=['k2'], sort_labels=True)
print(coo.tocsr())
Output 产量
(0, 0) 0.85
(1, 0) 0.21
(2, 1) -0.831
(3, 1) 0.12300000000000001
(4, 0) 0.9229999999999999
The to_coo method not only returns the matrix, but also the columns and row labels, hence doing also the inverse mapping. to_coo方法不仅返回矩阵,还返回列和行标签,因此也执行逆映射。 In the above example returns the following:
在上面的示例中返回以下内容:
['ef', 'kl', 'lion', 'plax', 'xyz']
['abc', 'cldex']
Where 'ef'
corresponds to index 0
of the rows and 'abc'
corresponds to index 0
of the columns. 其中
'ef'
对应于行的索引0
, 'abc'
对应于列的索引0
。
Given that you have the dict 鉴于你有这个词
dox = {'abc': {'ef': 0.85, 'kl': 0.21, 'xyz': 0.923},'cldex': {'plax': 0.123, 'lion': -0.831}}
this should help you take it to a sparsematrix: 这应该可以帮助你把它带到稀疏矩阵:
indptr = [0]
indices = []
data = []
vocabulary = {}
for d in dox:
for term in dox[d]:
index = vocabulary.setdefault(term, len(vocabulary))
indices.append(index)
data.append(dox[d][term])
indptr.append(len(indices))
mat = csr_matrix((data, indices, indptr), dtype=float)
This utilizes scipy's example for an incremental matrix build. 这利用了scipy的例子来进行增量矩阵构建。 Here is the output:
这是输出:
mat.todense()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.