[英]Aggregating rows in python pandas dataframe
I have a dataframe documenting when a product was added and removed from basket. 我有一个数据框记录了何时添加产品和从购物篮中删除产品。 However, the set_name
column contains two sets of information for the color set and the shape set. 但是, set_name
列包含有关颜色集和形状集的两组信息。 See below: 见下文:
eff_date prod_id set_name change_type
0 20150414 20770 MONO COLOR SET ADD
1 20150414 20770 REC SHAPE SET ADD
2 20150429 132 MONO COLOR SET ADD
3 20150429 132 REC SHAPE SET ADD
4 20150521 199 MONO COLOR SET DROP
5 20150521 199 REC SHAPE SET DROP
6 20150521 199 TET SHAPE SET ADD
7 20150521 199 MONO COLOR SET ADD
I would like to split out the two sets of information contained in set_name
into columns color_set
and shape_set
and drop set_name
. 我想将set_name
包含的两组信息拆分为color_set
和shape_set
列,然后删除set_name
。 so the previous df should look like: 所以以前的df应该看起来像:
eff_date prod_id change_type color_set shape_set
0 20150414 20770 ADD MONO COLOR SET REC SHAPE SET
1 20150429 132 ADD MONO COLOR SET REC SHAPE SET
2 20150521 199 DROP MONO COLOR SET REC SHAPE SET
3 20150521 199 ADD MONO COLOR SET TET SHAPE SET
I attempted first splitting out the columns in a for loop and then aggregating with groupby: 我尝试首先在for循环中拆分列,然后与groupby进行聚合:
for index, row in df.iterrows():
if 'COLOR' in df.loc[index,'set_name']:
df.loc[index,'color_set'] = df.loc[index,'set_name']
if 'SHAPE' in df.loc[index,'set_name']:
df.loc[index,'shape_set'] = df.loc[index,'set_name']
df = df.fillna('')
df.groupby(['eff_date','prod_id','change_type']).agg({'color_set':sum,'shape_set':sum})
However this left me with a dataframe of only two columns and multi-level index that i wasn't sure how to unstack. 但是,这给我留下了只有两列和多级索引的数据框,我不确定该如何拆栈。
color_set shape_set
eff_date prod_id change_type
20150414 20770 ADD MONO COLOR SET REC SHAPE SET
20150429 132 ADD MONO COLOR SET REC SHAPE SET
20150521 199 DROP MONO COLOR SET REC SHAPE SET
ADD MONO COLOR SET TET SHAPE SET
Any help on this is greatly appreciated! 在此方面的任何帮助将不胜感激!
Your code looks fine apart from having to reset your index, but we can simplify it quite a bit (in particular remove the need for iterrows
which can be painfully slow, using a pivot
with a small trick to get your column names. 你的代码看起来不必重置指数除了罚款,但我们可以把它简化了不少(尤其是不需要用iterrows
它可以是非常慢的,采用了pivot
的小窍门,让您的列名。
This answer assumes that you only have these two options in your column, if you have more categories, simply use numpy.select
instead of numpy.where
and define your conditions / outputs that way. 该答案假定您的列中只有两个选项,如果您有更多类别,则只需使用numpy.select
而不是numpy.where
并以此方式定义条件/输出。
df['key'] = np.where(df['set_name'].str.contains('COLOR'), 'color_set', 'shape_set')
df.pivot_table(
index=['eff_date', 'prod_id', 'change_type'],
columns='key',
values='set_name',
aggfunc='first'
).reset_index()
key eff_date prod_id change_type color_set shape_set
0 20150414 20770 ADD MONO COLOR SET REC SHAPE SET
1 20150429 132 ADD MONO COLOR SET REC SHAPE SET
2 20150521 199 ADD MONO COLOR SET TET SHAPE SET
3 20150521 199 DROP MONO COLOR SET REC SHAPE SET
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.