Pandas groupby and sum while retaining other attributes

Question

I have seen examples of Pandas' aggregate function, but those don't solve my problem. Because the examples of aggregate function either sum all the attributes or sum only few attributes and resulting df only has these summed attributes or the attributes used in groupby . In my case, I don't want to use certain attributes for either group by or sum and still keep them in the resulting df .

I am trying to group and sum some attributes, while preserving other attributes which are not summed but facing challenges as described below.

In my transaction dataset, Customer_ID are unique for each customer and entry time is unique for each transaction. Any customer will have multiple transactions during a period of time. Most transactions are repeated twice or more often depending on how many tags are associated with a transaction (but usually 2 to 4 tags). I need to combine such multiple entries of each transaction to only 1 row, with 1 customer_ID , one gender , age , entry time , location , country and all the Tag attributes.

If I group by only customer_ID , entry time and sum the Tags , the resulting dataframe has the correct number of unique customers: 150K. But I lose the attributes gender , age , location , country , exit time , value 1 , value 2 in the resulting df .

result = df.groupby(["customer_ID","entry time"])["Tag1", "Tag2","Tag3","Tag4","Tag5","Tag6","Tag7","Tag8"].sum().reset_index()

If I group by all the needed attributes and sum the Tags , I only get 90K unique customers, which is not correct.

result = df.groupby(["customer_ID", "entry time", "gender", "age","location", "country", "exit time", "value 1", "value 2"
 ])["Tag1","Tag2","Tag3","Tag4","Tag5","Tag6","Tag7","Tag8"].sum().reset_index()

So how do I efficiently group by only customer_ID and entry time , sum all the Tag columns and still retain other attributes in the resulting df (df size is around 700 MB)?

Answer 1

Ok, if I understand the question correctly, then I think this may work:

tag_cols = ["Tag1", "Tag2", "Tag3", "Tag4", "Tag5", "Tag6", "Tag7", "Tag8"]
join_cols = ["customer_ID", "entry time"]

df1 = df.groupby(join_cols)[tag_cols].sum().reset_index()
df2 = pd.merge(df1, df, on=tag_cols.append(join_cols), how="left")

Then df2 should have what you need.

Answer 2

Technically, you are attempting to aggregate on unique customer_ID and entry time (not unique customers). In order to maintain the other attributes, some aggregate decision has to be made for which values to retain. Consider extending a groupby().aggregate call to retrieve the first , last , min or max value.

agg_df = (df.groupby(['customer_ID', 'entry time'], as_index=False)
            .aggregate({'gender':'first', 'age':'first', 
                        'location':'first', 'country':'first', 
                        'exit time':'first', 'value 1':'first', 'value 2':'first',
                        'Tag1':'sum', 'Tag2':'sum', 'Tag3':'sum', 'Tag4':'sum', 
                        'Tag5':'sum', 'Tag6':'sum', 'Tag7':'sum', 'Tag8':'sum'})
         )

Pandas groupby and sum while retaining other attributes

Question

2 answers

solution1
1 2019-08-21 13:48:58

solution2
1 ACCPTED 2019-08-21 14:12:30

Pandas groupby and sum while retaining other attributes

Question

2 answers

solution1 1 2019-08-21 13:48:58

solution2 1 ACCPTED 2019-08-21 14:12:30

solution1
1 2019-08-21 13:48:58

solution2
1 ACCPTED 2019-08-21 14:12:30