Pandas GroupBy在同一個DataFrame的子集上

Question

這個問題是對我之前的問題的延伸。 我有一個pandas數據幀：

import pandas as pd
codes = ["one","two","three"];
colours = ["black", "white"];
textures = ["soft", "hard"];
N= 100 # length of the dataframe
df = pd.DataFrame({ 'id' : range(1,N+1),
                    'weeks_elapsed' : [random.choice(range(1,25)) for i in range(1,N+1)],
                    'code' : [random.choice(codes) for i in range(1,N+1)],
                    'colour': [random.choice(colours) for i in range(1,N+1)],
                    'texture': [random.choice(textures) for i in range(1,N+1)],
                    'size': [random.randint(1,100) for i in range(1,N+1)],
                    'scaled_size': [random.randint(100,1000) for i in range(1,N+1)]
                   },  columns= ['id', 'weeks_elapsed', 'code','colour', 'texture', 'size', 'scaled_size'])

我按colour和code對其進行分組，並獲得一些size和scaled_size統計信息，如下所示：

grouped = df.groupby(['code', 'colour']).agg( {'size': [np.sum, np.average, np.size, pd.Series.idxmax],'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]}).reset_index()

現在，我想要做的是在不同的weeks_elapsed間隔上多次對df運行上述計算。 下面是一個蠻力的解決方案，有更多的succint和更快的方式來運行它？ 另外，如何在單個數據幀中連接不同間隔的結果？

cut_offs= [4,12]
grouped = {c:{} for c in cut_offs}
for c in cut_offs:
   grouped[c] =df.ix[df.weeks_elapsed <= c ].groupby(['code', 'colour']).agg( 
                                                 {'size': [np.sum, np.average, np.size,pd.Series.idxmax],
                                                  'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]
                                                 }).reset_index()

對於不同的weeks_elapsed間隔，我對np.avg和np.size特別感興趣。

Answer 1

所以這不是一個完全可行的答案，但也許它可以擴展到最終讓你到達那里。

filter = array([12, 4])
for f in filter:
        df.loc[(df['weeks_elapsed'] <= f), 'filter'] = f

現在， df看起來像

>>> df.head()
Out[384]: 
   id  weeks_elapsed   code colour texture  size  adjusted_size  filter
0   1             20    one  white    soft    64            494     NaN
1   2              3  three  white    hard    22            650       4
2   3             22    two  black    hard    41            770     NaN
3   4              2    two  black    hard     4            325       4
4   5              4    two  black    hard    19            536       4

其中filter包含該行所屬的最小組。 下一步將是

>>> df.groupby(['filter', 'code', 'colour']).agg({'size': [np.sum, np.average, np.size, pd.Series.idxmax],
                                    'adjusted_size': [np.sum, np.average, np.size, pd.Series.idxmax]}
).reset_index()
Out[387]: 
    filter   code colour  adjusted_size                            size  \
                                    sum     average  size  idxmax   sum   
0        4    one  black           2195  548.750000     4      45   142   
1        4    one  white            286  286.000000     1      81    58   
2        4  three  black            927  463.500000     2      99   121   
3        4  three  white           5850  585.000000    10      95   511   
4        4    two  black           1102  367.333333     3       4    94   
5        4    two  white            852  852.000000     1      75     2   
6       12    one  white           2499  499.800000     5      72   267   
7       12  three  black           4709  588.625000     8      84   431   
8       12  three  white            569  189.666667     3      97   171   
9       12    two  black           2446  611.500000     4      49   241   
10      12    two  white           2859  714.750000     4      43   203   


      average  size  idxmax  
0   35.500000     4       5  
1   58.000000     1      81  
2   60.500000     2      99  
3   51.100000    10      88  
4   31.333333     3      21  
5    2.000000     1      75  
6   53.400000     5      69  
7   53.875000     8      12  
8   57.000000     3      59  
9   60.250000     4      36  
10  50.750000     4      43

但是，這些並不完全是您要查找的組： filter=4觀察將僅在屬於4的組中，而不在filter=12的組中。

我試着看看expanding_mean，但是這只會是行式的。 到目前為止，這是不完整的，但也許它可以幫助別人回答這個問題。

Answer 2

好吧，這是另一種選擇。 通過我的研究（我只是自己學習），實現你想要的重疊組的唯一方法顯然是TimeGrouper 。 但是，那個需要您的數據在一個時間范圍內。 實現此目的的一種方法如下：

filter = array([25, 12, 4]) # we need 25 here so we don't have NaN values later on
for i,f in enumerate(filter):
    df.loc[(df['weeks_elapsed'] <= f), 'filter'] = i + 1
df2 = df.set_index([pd.DatetimeIndex('2014-01-'+df['filter'].astype(int).astype(str))])
results = df2.groupby(pd.TimeGrouper('D')).apply(lambda x: x.groupby(['code', 'colour']).agg(
    {'size': [np.sum, np.average, np.size, pd.Series.idxmax],
     'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]
    }).reset_index())

現在results包含了奇怪格式的所有內容。 改變它

results.set_index(results.index.get_level_values(0).day, drop=True, inplace=True)
results.set_index(filter[results.index.values - 1], drop=True)
Out[490]: 
     code colour  scaled_size                   scaled_size  size             \
                          sum     average  size      idxmax   sum    average   
25    one  black         4655  517.222222     9  2014-01-01   331  36.777778   
25    one  white         2444  305.500000     8  2014-01-01   292  36.500000   
25  three  black         2068  344.666667     6  2014-01-01   246  41.000000   
25  three  white         2859  571.800000     5  2014-01-01   260  52.000000   
25    two  black         6330  575.454545    11  2014-01-01   599  54.454545   
25    two  white         3200  533.333333     6  2014-01-01   291  48.500000   
12    one  black         4004  667.333333     6  2014-01-02   331  55.166667   
12    one  white         2965  741.250000     4  2014-01-02   130  32.500000   
12  three  black         3040  608.000000     5  2014-01-02   344  68.800000   
12  three  white         3795  474.375000     8  2014-01-02   359  44.875000   
12    two  black         2198  314.000000     7  2014-01-02   323  46.142857   
12    two  white         3427  571.166667     6  2014-01-02   271  45.166667   
4     one  black         1501  500.333333     3  2014-01-03    73  24.333333   
4     one  white         1710  570.000000     3  2014-01-03   210  70.000000   
4   three  black         1461  730.500000     2  2014-01-03    14   7.000000   
4   three  white          961  480.500000     2  2014-01-03    14   7.000000   
4     two  black         1656  552.000000     3  2014-01-03   189  63.000000   
4     two  white         2462  410.333333     6  2014-01-03   352  58.666667   

               size  
    size     idxmax  
25     9 2014-01-01  
25     8 2014-01-01  
25     6 2014-01-01  
25     5 2014-01-01  
25    11 2014-01-01  
25     6 2014-01-01  
12     6 2014-01-02  
12     4 2014-01-02  
12     5 2014-01-02  
12     8 2014-01-02  
12     7 2014-01-02  
12     6 2014-01-02  
4      3 2014-01-03  
4      3 2014-01-03  
4      2 2014-01-03  
4      2 2014-01-03  
4      3 2014-01-03  
4      6 2014-01-03

Answer 3

@FooBar的答案可能更好（還沒有完全消化它），但這是另一種方法。

首先根據您的過濾條件創建一個返回自定義平均函數的函數。 內部函數只接受序列，外部函數定義要過濾的值，以及該序列來自哪個數據幀。

In [248]: def filter_average(base_df, filter_value, filter_by='weeks_elapsed'):
     ...:     def inner(x):
     ...:         return np.average(x[base_df[filter_by] <= filter_value])
     ...:     inner.__name__ = 'avg<=' + str(filter_value)
     ...:     return inner

然后，在groupby操作中，使用列表推導為不同的cutoff構建過濾器平均函數的版本，如下所示。 上面的__name__行是必要的，以便大小不同的標題。

In [249]: df.groupby(['code','colour']).agg({'size': [filter_average(df, i) 
                                                      for i in cut_offs]})
Out[249]: 
                   size           
                  avg<=4    avg<=12
code  colour                      
one   black   55.166667  56.555556
      white   81.750000  58.583333
three black         NaN  32.000000
      white   40.333333  36.400000
two   black   32.000000  37.714286
      white   95.000000  45.000000

同樣的方法可以使用np.size ，甚至可以構建到更通用的裝飾器中。

Pandas GroupBy在同一個DataFrame的子集上

問題描述

3 個解決方案

解決方案1
1 2014-06-18 13:48:47

解決方案2
1 2014-06-18 14:29:49

解決方案3
1 2014-06-18 14:58:58

Pandas GroupBy在同一個DataFrame的子集上

問題描述

3 個解決方案

解決方案1 1 2014-06-18 13:48:47

解決方案2 1 2014-06-18 14:29:49

解決方案3 1 2014-06-18 14:58:58

解決方案1
1 2014-06-18 13:48:47

解決方案2
1 2014-06-18 14:29:49

解決方案3
1 2014-06-18 14:58:58