[英]A scalable way of representing and accessing sparse data in Python
I have a sparse binary matrix represented with a file like this: 我有一个用以下文件表示的稀疏二进制矩阵:
p_1|m_11
p_1|m_12
p_1|m_13
...
p_1|m_1,N1
p_2|m_21
p_2|m_22
...
p_2|m_2,N2
...
p_K|m_K1
...
p_K|m_K,NK
p
's and m
's comes from two respective sets. p
和m
来自两个各自的集合。 If there are K
unique p
's and L
unique m
's, the above represents a sparse KXL
matrix with each row corresponding to a single 1
element of the matrix. 如果存在K
唯一的p
和L
唯一的m
,则以上表示一个稀疏的KXL
矩阵,每行对应于矩阵的单个1
元素。
p
's are integers; p
是整数; m
's are alphanum strings m
是字母数字字符串
I need to have fast access to both individual elements of the matrix and its rows and columns. 我需要快速访问矩阵的各个元素及其行和列。 The current implementation shown below worked fine for small values of K
( L
is always about 50,000
) but does not scale. 下面显示的当前实现对于K
较小值( L
始终约为50,000
)可以很好地工作,但无法缩放。
from scipy import sparse
from numpy import array
import numpy as np
# 1st pass: collect unique ps and ms
Ps = set()
Ms = set()
nnz = 0
with open('data.txt','r') as fin:
for line in fin:
parts = line.strip().split('|')
Ps.add(parts[0])
Ms.add(parts[1])
nnz += 1
Ps = list(Ps).sort() # optional but prefer sorted
Ms = list(Ms).sort() # optional but prefer sorted
K = len(Ps)
L = len(Ms)
# 2nd pass: create sparse mx
I = np.zeros(nnz)
J = np.zeros(nnz)
V = np.ones(nnz)
ix = 0
with open('data.txt','r') as fin:
for line in fin:
parts = line.strip().split('|')
I[ix] = Ps.index(parts[0]) # TAKES TOO LONG FOR LARGE K
J[ix] = Ms.index(parts[1])
ix += 1
data = sparse.coo_matrix((V,(I,J)),shape=(K,L)).tocsr()
There is gotta be a different way of doing this that scales better, but what is it? 可以采用其他方法来更好地扩展规模,但这是什么呢?
I am not married to the sparse matrix format ( dict
? ), I am willing to use any data structure that allows me fast access to individual elements, "rows" and "columns" 我不喜欢稀疏矩阵格式( dict
吗?),我愿意使用任何允许我快速访问单个元素(“行”和“列”)的数据结构。
CLARIFICATION ( I hope ) : 澄清 (我希望) :
I am trying to move away from retrieving elements, rows and columns of my data using integer row/column values that get extracted by searching through two long arrays of strings. 我试图使用整数行/列值来获取数据的元素,行和列,而整数行/列值是通过搜索两个长字符串数组来提取的。
Instead I just want to use actual p
s and m
s as keys, so instead of data[i,j]
I want to use something like data[p_10,m_15]
; 相反,我只是想用实际p
S和m
S作为密钥,所以不是data[i,j]
我想用类似data[p_10,m_15]
; and instead of data[i,:]
use something like data[p_10,:]
. 而不是data[i,:]
使用诸如data[p_10,:]
类的东西。
I also need to be able to create data
fast from my data file. 我还需要能够从我的数据文件快速创建data
。
Again, data
does not need to be a scipy
or numpy
sparse matrix. 再次, data
并不需要是一个scipy
或numpy
稀疏矩阵。
I was able to speed up the 2nd pass below by simply creating two inverse indices: 我可以通过简单地创建两个反索引来加快下面的第二遍:
from scipy import sparse
from numpy import array
import numpy as np
# 1st pass: collect unique ps and ms
Ps = set()
Ms = set()
nnz = 0
with open('data.txt','r') as fin:
for line in fin:
parts = line.strip().split('|')
Ps.add(parts[0])
Ms.add(parts[1])
nnz += 1
Ps = list(Ps).sort() # optional but prefer sorted
Ms = list(Ms).sort() # optional but prefer sorted
K = len(Ps)
L = len(Ms)
# create inverse indices for quick lookup
#
mapPs = dict()
for i in range(len(Ps)):
mapPs[Ps[i]] = i
mapMs = dict()
for i in range(len(Ms)):
mapMs[Ms[i]] = i
# 2nd pass: create sparse mx
I = np.zeros(nnz)
J = np.zeros(nnz)
V = np.ones(nnz)
ix = 0
with open('data.txt','r') as fin:
for line in fin:
parts = line.strip().split('|')
#I[ix] = Ps.index(parts[0]) # TAKES TOO LONG FOR LARGE K
#J[ix] = Ms.index(parts[1]) # TAKES TOO LONG FOR LARGE K
I[ix] = mapPs[parts[0]]
J[ix] = mapMs[parts[1]]
ix += 1
data = sparse.coo_matrix((V,(I,J)),shape=(K,L)).tocsr()
I did not have a chance to test it on a much larger dataset but on a smaller one I had problems with, execution time went from about 1 hour to about 10 seconds! 我没有机会在一个更大的数据集上进行测试,但是在一个我遇到问题的较小数据集上,执行时间从大约1小时缩短到大约10秒! So I am satisfied with this solution for now. 因此,我对此解决方案感到满意。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.