简体   繁体   English

在稀疏矩阵上分组

[英]group by on scipy sparse matrix

I have a scipy sparse matrix with 10e6 rows and 10e3 columns, populated to 1%. 我有一个稀疏的稀疏矩阵,具有10e6行和10e3列,填充到1%。 I also have an array of size 10e6 which contains keys corresponding to the 10e6 rows of my sparse matrix. 我还有一个大小为10e6的数组,其中包含与稀疏矩阵的10e6行相对应的键。 I want to group my sparse matrix following these keys and aggregate with a sum function. 我想按照这些键对稀疏矩阵进行分组,并使用求和函数进行汇总。

Example: 例:

Keys:
['foo','bar','foo','baz','baz','bar']

Sparse matrix:
(0,1) 3              -> corresponds to the first 'foo' key
(0,10) 4             -> corresponds to the first 'bar' key
(2,1) 1              -> corresponds to the second 'foo' key
(1,3) 2              -> corresponds to the first 'baz' key
(2,3) 10             -> corresponds to the second 'baz' key
(2,4) 1              -> corresponds to the second 'bar' key

Expected result:
{
    'foo': {1: 4},               -> 4 = 3 + 1
    'bar': {4: 1, 10: 4},        
    'baz': {3: 12}               -> 12 = 2 + 10
}

What is the more efficient way to do it? 什么是更有效的方法?

I already tried to use pandas.SparseSeries.from_coo on my sparse matrix in order to be able to use pandas group by but I get this known bug: 我已经尝试在稀疏矩阵上使用pandas.SparseSeries.from_coo以便能够使用pandas group by,但是我得到了这个已知的错误:

site-packages/pandas/tools/merge.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
    863         for obj in objs:
    864             if not isinstance(obj, NDFrame):
--> 865                 raise TypeError("cannot concatenate a non-NDFrame object")
    866 
    867             # consolidate

TypeError: cannot concatenate a non-NDFrame object

I can generate your target with basic dictionary and list operations: 我可以使用基本的字典和列表操作生成目标:

keys = ['foo','bar','foo','baz','baz','bar']
rows = [0,0,2,1,2,2]; cols=[1,10,1,3,3,4]; data=[3,4,1,2,10,1]
dd = {}
for i,k in enumerate(keys):
    d1 = dd.get(k, {})
    v = d1.get(cols[i], 0)
    d1[cols[i]] = v + data[i]
    dd[k] = d1
print dd

producing 生产

{'baz': {3: 12}, 'foo': {1: 4}, 'bar': {10: 4, 4: 1}}

I can generate a sparse matrix from this data as well with: 我也可以从此数据生成一个稀疏矩阵:

import numpy as np
from scipy import sparse
M = sparse.coo_matrix((data,(rows,cols)))
print M
print
Md = M.todok()
print Md

But notice that the order of terms is not fixed. 但是请注意,术语顺序不是固定的。 In the coo the order is as entered, but change format and the order changes. coo ,订单是按输入的,但是更改格式后订单会更改。 In other words the match between keys and the elements of the sparse matrix is unspecified. 换句话说, keys和稀疏矩阵的元素之间的匹配是不确定的。

  (0, 1)    3
  (0, 10)   4
  (2, 1)    1
  (1, 3)    2
  (2, 3)    10
  (2, 4)    1

  (0, 1)    3
  (1, 3)    2
  (2, 1)    1
  (2, 3)    10
  (0, 10)   4
  (2, 4)    1

Until you clear up this mapping, the initial dictionary approach is best. 在您清除此映射之前,最好使用初始词典方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM