简体   繁体   English

聚合python pandas dataframe中的行

[英]Aggregating rows in python pandas dataframe

I have a dataframe documenting when a product was added and removed from basket. 我有一个数据框记录了何时添加产品和从购物篮中删除产品。 However, the set_name column contains two sets of information for the color set and the shape set. 但是, set_name列包含有关颜色集和形状集的两组信息。 See below: 见下文:

   eff_date  prod_id   set_name         change_type           
0  20150414  20770     MONO COLOR SET   ADD             
1  20150414  20770     REC SHAPE SET    ADD         
2  20150429  132       MONO COLOR SET   ADD                
3  20150429  132       REC SHAPE SET    ADD        
4  20150521  199       MONO COLOR SET   DROP
5  20150521  199       REC SHAPE SET    DROP
6  20150521  199       TET SHAPE SET    ADD
7  20150521  199       MONO COLOR SET   ADD

I would like to split out the two sets of information contained in set_name into columns color_set and shape_set and drop set_name . 我想将set_name包含的两组信息拆分为color_setshape_set列,然后删除set_name so the previous df should look like: 所以以前的df应该看起来像:

   eff_date  prod_id   change_type  color_set       shape_set     
0  20150414  20770     ADD          MONO COLOR SET  REC SHAPE SET          
1  20150429  132       ADD          MONO COLOR SET  REC SHAPE SET
2  20150521  199       DROP         MONO COLOR SET  REC SHAPE SET
3  20150521  199       ADD          MONO COLOR SET  TET SHAPE SET

I attempted first splitting out the columns in a for loop and then aggregating with groupby: 我尝试首先在for循环中拆分列,然后与groupby进行聚合:

for index, row in df.iterrows():
    if 'COLOR' in df.loc[index,'set_name']:
        df.loc[index,'color_set'] = df.loc[index,'set_name']
    if 'SHAPE' in df.loc[index,'set_name']:
        df.loc[index,'shape_set'] = df.loc[index,'set_name']
df = df.fillna('')
df.groupby(['eff_date','prod_id','change_type']).agg({'color_set':sum,'shape_set':sum})

However this left me with a dataframe of only two columns and multi-level index that i wasn't sure how to unstack. 但是,这给我留下了只有两列和多级索引的数据框,我不确定该如何拆栈。

                                color_set       shape_set
eff_date  prod_id  change_type 
20150414  20770    ADD          MONO COLOR SET  REC SHAPE SET
20150429  132      ADD          MONO COLOR SET  REC SHAPE SET
20150521  199      DROP         MONO COLOR SET  REC SHAPE SET
                   ADD          MONO COLOR SET  TET SHAPE SET

Any help on this is greatly appreciated! 在此方面的任何帮助将不胜感激!

Your code looks fine apart from having to reset your index, but we can simplify it quite a bit (in particular remove the need for iterrows which can be painfully slow, using a pivot with a small trick to get your column names. 你的代码看起来不必重置指数除了罚款,但我们可以把它简化了不少(尤其是不需要用iterrows它可以是非常慢的,采用了pivot的小窍门,让您的列名。

This answer assumes that you only have these two options in your column, if you have more categories, simply use numpy.select instead of numpy.where and define your conditions / outputs that way. 该答案假定您的列中只有两个选项,如果您有更多类别,则只需使用numpy.select而不是numpy.where并以此方式定义条件/输出。


df['key'] = np.where(df['set_name'].str.contains('COLOR'), 'color_set', 'shape_set')

df.pivot_table(
  index=['eff_date', 'prod_id', 'change_type'],
  columns='key',
  values='set_name',
  aggfunc='first'
).reset_index()

key  eff_date  prod_id change_type       color_set      shape_set
0    20150414    20770         ADD  MONO COLOR SET  REC SHAPE SET
1    20150429      132         ADD  MONO COLOR SET  REC SHAPE SET
2    20150521      199         ADD  MONO COLOR SET  TET SHAPE SET
3    20150521      199        DROP  MONO COLOR SET  REC SHAPE SET

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM