用Python表示和訪問稀疏數據的可擴展方式

Question

我有一個用以下文件表示的稀疏二進制矩陣：

p_1|m_11
p_1|m_12
p_1|m_13
...
p_1|m_1,N1
p_2|m_21
p_2|m_22
...
p_2|m_2,N2
...
p_K|m_K1
...
p_K|m_K,NK

p和m來自兩個各自的集合。 如果存在K唯一的p和L唯一的m ，則以上表示一個稀疏的KXL矩陣，每行對應於矩陣的單個1元素。

p是整數； m是字母數字字符串

我需要快速訪問矩陣的各個元素及其行和列。 下面顯示的當前實現對於K較小值（ L始終約為50,000 ）可以很好地工作，但無法縮放。

from scipy import sparse
from numpy import array
import numpy as np

# 1st pass: collect unique ps and ms
Ps = set()
Ms = set()
nnz = 0
with open('data.txt','r') as fin:
    for line in fin:
        parts = line.strip().split('|')
        Ps.add(parts[0])
        Ms.add(parts[1])
        nnz += 1

Ps = list(Ps).sort()    # optional but prefer sorted
Ms = list(Ms).sort()    # optional but prefer sorted
K = len(Ps)
L = len(Ms)

# 2nd pass: create sparse mx
I = np.zeros(nnz)
J = np.zeros(nnz)
V = np.ones(nnz)

ix = 0
with open('data.txt','r') as fin:
    for line in fin:
        parts = line.strip().split('|')
        I[ix] = Ps.index(parts[0])  # TAKES TOO LONG FOR LARGE K
        J[ix] = Ms.index(parts[1])
        ix += 1

data = sparse.coo_matrix((V,(I,J)),shape=(K,L)).tocsr()

可以采用其他方法來更好地擴展規模，但這是什么呢？

我不喜歡稀疏矩陣格式（ dict嗎？），我願意使用任何允許我快速訪問單個元素（“行”和“列”）的數據結構。

澄清 ^{（我希望）} ：
我試圖使用整數行/列值來獲取數據的元素，行和列，而整數行/列值是通過搜索兩個長字符串數組來提取的。

相反，我只是想用實際p S和m S作為密鑰，所以不是data[i,j]我想用類似data[p_10,m_15] ; 而不是data[i,:]使用諸如data[p_10,:]類的東西。

我還需要能夠從我的數據文件快速創建data 。

再次， data並不需要是一個scipy或numpy稀疏矩陣。

Answer 1

我可以通過簡單地創建兩個反索引來加快下面的第二遍：

from scipy import sparse
from numpy import array
import numpy as np

# 1st pass: collect unique ps and ms
Ps = set()
Ms = set()
nnz = 0
with open('data.txt','r') as fin:
    for line in fin:
        parts = line.strip().split('|')
        Ps.add(parts[0])
        Ms.add(parts[1])
        nnz += 1

Ps = list(Ps).sort()    # optional but prefer sorted
Ms = list(Ms).sort()    # optional but prefer sorted
K = len(Ps)
L = len(Ms)

# create inverse indices for quick lookup
#
mapPs = dict()
for i in range(len(Ps)):
    mapPs[Ps[i]] = i

mapMs = dict()
for i in range(len(Ms)):
    mapMs[Ms[i]] = i

# 2nd pass: create sparse mx
I = np.zeros(nnz)
J = np.zeros(nnz)
V = np.ones(nnz)

ix = 0
with open('data.txt','r') as fin:
    for line in fin:
        parts = line.strip().split('|')
        #I[ix] = Ps.index(parts[0]) # TAKES TOO LONG FOR LARGE K
        #J[ix] = Ms.index(parts[1]) # TAKES TOO LONG FOR LARGE K
        I[ix] = mapPs[parts[0]]
        J[ix] = mapMs[parts[1]]
        ix += 1

data = sparse.coo_matrix((V,(I,J)),shape=(K,L)).tocsr()

我沒有機會在一個更大的數據集上進行測試，但是在一個我遇到問題的較小數據集上，執行時間從大約1小時縮短到大約10秒！ 因此，我對此解決方案感到滿意。

用Python表示和訪問稀疏數據的可擴展方式

問題描述

1 個解決方案

解決方案1
0 2015-11-10 14:30:26

用Python表示和訪問稀疏數據的可擴展方式

問題描述

1 個解決方案

解決方案1 0 2015-11-10 14:30:26

解決方案1
0 2015-11-10 14:30:26