简体   繁体   中英

Pandas groupby and sum while retaining other attributes

I have seen examples of Pandas' aggregate function, but those don't solve my problem. Because the examples of aggregate function either sum all the attributes or sum only few attributes and resulting df only has these summed attributes or the attributes used in groupby . In my case, I don't want to use certain attributes for either group by or sum and still keep them in the resulting df .

I am trying to group and sum some attributes, while preserving other attributes which are not summed but facing challenges as described below.

数据片段

In my transaction dataset, Customer_ID are unique for each customer and entry time is unique for each transaction. Any customer will have multiple transactions during a period of time. Most transactions are repeated twice or more often depending on how many tags are associated with a transaction (but usually 2 to 4 tags). I need to combine such multiple entries of each transaction to only 1 row, with 1 customer_ID , one gender , age , entry time , location , country and all the Tag attributes.

If I group by only customer_ID , entry time and sum the Tags , the resulting dataframe has the correct number of unique customers: 150K. But I lose the attributes gender , age , location , country , exit time , value 1 , value 2 in the resulting df .

result = df.groupby(["customer_ID","entry time"])["Tag1", "Tag2","Tag3","Tag4","Tag5","Tag6","Tag7","Tag8"].sum().reset_index()

If I group by all the needed attributes and sum the Tags , I only get 90K unique customers, which is not correct.

result = df.groupby(["customer_ID", "entry time", "gender", "age","location", "country", "exit time", "value 1", "value 2"
 ])["Tag1","Tag2","Tag3","Tag4","Tag5","Tag6","Tag7","Tag8"].sum().reset_index()

1个事务的行示例 我想要1次交易的示例

So how do I efficiently group by only customer_ID and entry time , sum all the Tag columns and still retain other attributes in the resulting df (df size is around 700 MB)?

Ok, if I understand the question correctly, then I think this may work:

tag_cols = ["Tag1", "Tag2", "Tag3", "Tag4", "Tag5", "Tag6", "Tag7", "Tag8"]
join_cols = ["customer_ID", "entry time"]

df1 = df.groupby(join_cols)[tag_cols].sum().reset_index()
df2 = pd.merge(df1, df, on=tag_cols.append(join_cols), how="left")

Then df2 should have what you need.

Technically, you are attempting to aggregate on unique customer_ID and entry time (not unique customers). In order to maintain the other attributes, some aggregate decision has to be made for which values to retain. Consider extending a groupby().aggregate call to retrieve the first , last , min or max value.

agg_df = (df.groupby(['customer_ID', 'entry time'], as_index=False)
            .aggregate({'gender':'first', 'age':'first', 
                        'location':'first', 'country':'first', 
                        'exit time':'first', 'value 1':'first', 'value 2':'first',
                        'Tag1':'sum', 'Tag2':'sum', 'Tag3':'sum', 'Tag4':'sum', 
                        'Tag5':'sum', 'Tag6':'sum', 'Tag7':'sum', 'Tag8':'sum'})
         )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM