[英]group by on scipy sparse matrix
I have a scipy sparse matrix with 10e6 rows and 10e3 columns, populated to 1%. 我有一个稀疏的稀疏矩阵,具有10e6行和10e3列,填充到1%。 I also have an array of size 10e6 which contains keys corresponding to the 10e6 rows of my sparse matrix.
我还有一个大小为10e6的数组,其中包含与稀疏矩阵的10e6行相对应的键。 I want to group my sparse matrix following these keys and aggregate with a sum function.
我想按照这些键对稀疏矩阵进行分组,并使用求和函数进行汇总。
Example: 例:
Keys:
['foo','bar','foo','baz','baz','bar']
Sparse matrix:
(0,1) 3 -> corresponds to the first 'foo' key
(0,10) 4 -> corresponds to the first 'bar' key
(2,1) 1 -> corresponds to the second 'foo' key
(1,3) 2 -> corresponds to the first 'baz' key
(2,3) 10 -> corresponds to the second 'baz' key
(2,4) 1 -> corresponds to the second 'bar' key
Expected result:
{
'foo': {1: 4}, -> 4 = 3 + 1
'bar': {4: 1, 10: 4},
'baz': {3: 12} -> 12 = 2 + 10
}
What is the more efficient way to do it? 什么是更有效的方法?
I already tried to use pandas.SparseSeries.from_coo
on my sparse matrix in order to be able to use pandas group by but I get this known bug: 我已经尝试在稀疏矩阵上使用
pandas.SparseSeries.from_coo
以便能够使用pandas group by,但是我得到了这个已知的错误:
site-packages/pandas/tools/merge.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
863 for obj in objs:
864 if not isinstance(obj, NDFrame):
--> 865 raise TypeError("cannot concatenate a non-NDFrame object")
866
867 # consolidate
TypeError: cannot concatenate a non-NDFrame object
I can generate your target with basic dictionary and list operations: 我可以使用基本的字典和列表操作生成目标:
keys = ['foo','bar','foo','baz','baz','bar']
rows = [0,0,2,1,2,2]; cols=[1,10,1,3,3,4]; data=[3,4,1,2,10,1]
dd = {}
for i,k in enumerate(keys):
d1 = dd.get(k, {})
v = d1.get(cols[i], 0)
d1[cols[i]] = v + data[i]
dd[k] = d1
print dd
producing 生产
{'baz': {3: 12}, 'foo': {1: 4}, 'bar': {10: 4, 4: 1}}
I can generate a sparse matrix from this data as well with: 我也可以从此数据生成一个稀疏矩阵:
import numpy as np
from scipy import sparse
M = sparse.coo_matrix((data,(rows,cols)))
print M
print
Md = M.todok()
print Md
But notice that the order of terms is not fixed. 但是请注意,术语顺序不是固定的。 In the
coo
the order is as entered, but change format and the order changes. 在
coo
,订单是按输入的,但是更改格式后订单会更改。 In other words the match between keys
and the elements of the sparse matrix is unspecified. 换句话说,
keys
和稀疏矩阵的元素之间的匹配是不确定的。
(0, 1) 3
(0, 10) 4
(2, 1) 1
(1, 3) 2
(2, 3) 10
(2, 4) 1
(0, 1) 3
(1, 3) 2
(2, 1) 1
(2, 3) 10
(0, 10) 4
(2, 4) 1
Until you clear up this mapping, the initial dictionary approach is best. 在您清除此映射之前,最好使用初始词典方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.