简体   繁体   English

熊猫groupby和sum,同时保留其他属性

[英]Pandas groupby and sum while retaining other attributes

I have seen examples of Pandas' aggregate function, but those don't solve my problem. 我已经看到了Pandas aggregate函数的示例,但是这些并不能解决我的问题。 Because the examples of aggregate function either sum all the attributes or sum only few attributes and resulting df only has these summed attributes or the attributes used in groupby . 因为聚合函数的示例要么对所有属性求和,要么对几个属性求和,结果df仅具有这些求和的属性或groupby使用的属性。 In my case, I don't want to use certain attributes for either group by or sum and still keep them in the resulting df . 就我而言,我不想为group by或sum使用某些属性,而仍将它们保留在结果df

I am trying to group and sum some attributes, while preserving other attributes which are not summed but facing challenges as described below. 我试图对一些属性进行分组和求和,同时保留其他未求和但面临如下挑战的属性。

数据片段

In my transaction dataset, Customer_ID are unique for each customer and entry time is unique for each transaction. 在我的交易数据集中, Customer_ID对于每个客户都是唯一的, entry time对于每个交易都是唯一的。 Any customer will have multiple transactions during a period of time. 任何客户在一段时间内都会进行多次交易。 Most transactions are repeated twice or more often depending on how many tags are associated with a transaction (but usually 2 to 4 tags). 大多数事务重复两次或更多次,具体取决于与事务关联的标签数量(但通常为2到4个标签)。 I need to combine such multiple entries of each transaction to only 1 row, with 1 customer_ID , one gender , age , entry time , location , country and all the Tag attributes. 我需要将每笔交易的多个条目仅合并为1行,带有1 customer_ID ,一个genderageentry timelocationcountry和所有Tag属性。

If I group by only customer_ID , entry time and sum the Tags , the resulting dataframe has the correct number of unique customers: 150K. 如果我仅按customer_IDentry time分组并对Tag求和,则结果数据帧将具有正确的唯一客户数:150K。 But I lose the attributes gender , age , location , country , exit time , value 1 , value 2 in the resulting df . 但是我在生成的df丢失了genderagelocationcountryexit timevalue 1value 2这两个属性。

result = df.groupby(["customer_ID","entry time"])["Tag1", "Tag2","Tag3","Tag4","Tag5","Tag6","Tag7","Tag8"].sum().reset_index()

If I group by all the needed attributes and sum the Tags , I only get 90K unique customers, which is not correct. 如果我按所有需要的属性分组并汇总Tag ,则我只能获得90K唯一客户,这是不正确的。

result = df.groupby(["customer_ID", "entry time", "gender", "age","location", "country", "exit time", "value 1", "value 2"
 ])["Tag1","Tag2","Tag3","Tag4","Tag5","Tag6","Tag7","Tag8"].sum().reset_index()

1个事务的行示例 我想要1次交易的示例

So how do I efficiently group by only customer_ID and entry time , sum all the Tag columns and still retain other attributes in the resulting df (df size is around 700 MB)? 那么,如何有效地仅按customer_IDentry time分组,对所有Tag列求和,并仍然在生成的df保留其他属性(df大小约为700 MB)?

Ok, if I understand the question correctly, then I think this may work: 好吧,如果我正确理解了这个问题,那么我认为这可能会起作用:

tag_cols = ["Tag1", "Tag2", "Tag3", "Tag4", "Tag5", "Tag6", "Tag7", "Tag8"]
join_cols = ["customer_ID", "entry time"]

df1 = df.groupby(join_cols)[tag_cols].sum().reset_index()
df2 = pd.merge(df1, df, on=tag_cols.append(join_cols), how="left")

Then df2 should have what you need. 然后df2应该具有您所需要的。

Technically, you are attempting to aggregate on unique customer_ID and entry time (not unique customers). 从技术上讲,您正在尝试汇总唯一的customer_ID输入时间 (不是唯一的客户)。 In order to maintain the other attributes, some aggregate decision has to be made for which values to retain. 为了保持其他属性,必须做出一些汇总决定以保留哪些值。 Consider extending a groupby().aggregate call to retrieve the first , last , min or max value. 考虑扩展groupby().aggregate调用以检索firstlastminmax值。

agg_df = (df.groupby(['customer_ID', 'entry time'], as_index=False)
            .aggregate({'gender':'first', 'age':'first', 
                        'location':'first', 'country':'first', 
                        'exit time':'first', 'value 1':'first', 'value 2':'first',
                        'Tag1':'sum', 'Tag2':'sum', 'Tag3':'sum', 'Tag4':'sum', 
                        'Tag5':'sum', 'Tag6':'sum', 'Tag7':'sum', 'Tag8':'sum'})
         )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM