简体   繁体   中英

Pandas: Summing arrays as as an aggregation with multiple groupby columns

I'm using Python 3.5.1 and Pandas 0.18.0.

Let's say I have a Pandas dataframe with multiple columns. The dataframe has one column that includes a numpy array. Here is an example:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame([{'A': 'Label1', 'B': 'yellow', 'C': np.array([0,0,0]), 'D': 1},
                       {'A': 'Label2', 'B': 'yellow', 'C': np.array([1,1,1]), 'D': 4},
                       {'A': 'Label1', 'B': 'yellow', 'C': np.array([1,0,1]), 'D': 2},
                       {'A': 'Label2', 'B': 'green', 'C': np.array([1,1,0]), 'D': 3}])
>>> df
        A       B          C  D
0  Label1  yellow  [0, 1, 0]  1
1  Label2  yellow  [1, 1, 1]  4
2  Label1  yellow  [1, 0, 1]  2
3  Label2   green  [1, 1, 0]  3

I want to create a dataframe that groups by columns A and B and aggregates columns C and D with a sum. Like this:

               C         D
A      B
Label1 yellow  [1, 1, 1] 3
Label2 green   [1, 1, 0] 3
       yellow  [1, 1, 1] 4

When I try and do the aggregation using the entire dataframe, column C (the one with the numpy arrays) is not returned:

>>> df.groupby(['A','B']).sum()
               D
A      B
Label1 yellow  3
Label2 green   3
       yellow  4

If I ignore column D and only attempt to output column C, I get an error:

>>> df[['A','B','C']].groupby(['A','B']).sum()
Traceback (most recent call last):
  File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 96, in f
    return self._cython_agg_general(alias, numeric_only=numeric_only)
  File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3038, in _cython_agg_general
    how, numeric_only=numeric_only)
  File "C:\Anaconda3\lib\site-packages\pandas\core\groupby.py", line 3084, in _cython_agg_blocks
    raise DataError('No numeric types to aggregate')
pandas.core.base.DataError: No numeric types to aggregate

If I group by only a single column and only output my array column, the arrays sum correctly:

>>> df[['A','C']].groupby(['A']).sum()
                C
A
Label1  [1, 1, 1]
Label2  [2, 2, 1]

But if I try to include the scalar column as an aggregate as well, my array column again is not returned:

>>> df[['A','C','D']].groupby(['A']).sum()
        D
A
Label1  3
Label2  7

Also, if I try and include column B (contains strings) in the aggregate function, columns B and C return but column D does not:

>>> df[['A','B','C']].groupby(['A']).sum()
               B          C
A
Label1  yellowyellow  [1, 1, 1]
Label2   yellowgreen  [2, 2, 1]

Can anyone explain why this is happening? I know I could create a [A+B] column and then group by that, sum my array column, and then merge the result it back in with the rest of my data on column [A+B], but it seems like there should be a much simpler way. Any ideas?

pd.concat separate groupbys is a workaround

g = df.groupby(['A', 'B'])
pd.concat([g.C.apply(np.sum), g.D.sum()], axis=1)

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM