[英]Pandas groupby and sum while retaining other attributes
I have seen examples of Pandas' aggregate
function, but those don't solve my problem. 我已经看到了Pandas aggregate
函数的示例,但是这些并不能解决我的问题。 Because the examples of aggregate function either sum all the attributes or sum only few attributes and resulting df
only has these summed attributes or the attributes used in groupby
. 因为聚合函数的示例要么对所有属性求和,要么对几个属性求和,结果df
仅具有这些求和的属性或groupby
使用的属性。 In my case, I don't want to use certain attributes for either group by or sum and still keep them in the resulting df
. 就我而言,我不想为group by或sum使用某些属性,而仍将它们保留在结果df
。
I am trying to group and sum some attributes, while preserving other attributes which are not summed but facing challenges as described below. 我试图对一些属性进行分组和求和,同时保留其他未求和但面临如下挑战的属性。
In my transaction dataset, Customer_ID
are unique for each customer and entry time
is unique for each transaction. 在我的交易数据集中, Customer_ID
对于每个客户都是唯一的, entry time
对于每个交易都是唯一的。 Any customer will have multiple transactions during a period of time. 任何客户在一段时间内都会进行多次交易。 Most transactions are repeated twice or more often depending on how many tags are associated with a transaction (but usually 2 to 4 tags). 大多数事务重复两次或更多次,具体取决于与事务关联的标签数量(但通常为2到4个标签)。 I need to combine such multiple entries of each transaction to only 1 row, with 1 customer_ID
, one gender
, age
, entry time
, location
, country
and all the Tag attributes. 我需要将每笔交易的多个条目仅合并为1行,带有1 customer_ID
,一个gender
, age
, entry time
, location
, country
和所有Tag属性。
If I group by only customer_ID
, entry time
and sum the Tags , the resulting dataframe has the correct number of unique customers: 150K. 如果我仅按customer_ID
, entry time
分组并对Tag求和,则结果数据帧将具有正确的唯一客户数:150K。 But I lose the attributes gender
, age
, location
, country
, exit time
, value 1
, value 2
in the resulting df
. 但是我在生成的df
丢失了gender
, age
, location
, country
, exit time
, value 1
和value 2
这两个属性。
result = df.groupby(["customer_ID","entry time"])["Tag1", "Tag2","Tag3","Tag4","Tag5","Tag6","Tag7","Tag8"].sum().reset_index()
If I group by all the needed attributes and sum the Tags , I only get 90K unique customers, which is not correct. 如果我按所有需要的属性分组并汇总Tag ,则我只能获得90K唯一客户,这是不正确的。
result = df.groupby(["customer_ID", "entry time", "gender", "age","location", "country", "exit time", "value 1", "value 2"
])["Tag1","Tag2","Tag3","Tag4","Tag5","Tag6","Tag7","Tag8"].sum().reset_index()
So how do I efficiently group by only customer_ID
and entry time
, sum all the Tag
columns and still retain other attributes in the resulting df
(df size is around 700 MB)? 那么,如何有效地仅按customer_ID
和entry time
分组,对所有Tag
列求和,并仍然在生成的df
保留其他属性(df大小约为700 MB)?
Ok, if I understand the question correctly, then I think this may work: 好吧,如果我正确理解了这个问题,那么我认为这可能会起作用:
tag_cols = ["Tag1", "Tag2", "Tag3", "Tag4", "Tag5", "Tag6", "Tag7", "Tag8"]
join_cols = ["customer_ID", "entry time"]
df1 = df.groupby(join_cols)[tag_cols].sum().reset_index()
df2 = pd.merge(df1, df, on=tag_cols.append(join_cols), how="left")
Then df2
should have what you need. 然后df2
应该具有您所需要的。
Technically, you are attempting to aggregate on unique customer_ID and entry time (not unique customers). 从技术上讲,您正在尝试汇总唯一的customer_ID和输入时间 (不是唯一的客户)。 In order to maintain the other attributes, some aggregate decision has to be made for which values to retain. 为了保持其他属性,必须做出一些汇总决定以保留哪些值。 Consider extending a groupby().aggregate
call to retrieve the first
, last
, min
or max
value. 考虑扩展groupby().aggregate
调用以检索first
, last
, min
或max
值。
agg_df = (df.groupby(['customer_ID', 'entry time'], as_index=False)
.aggregate({'gender':'first', 'age':'first',
'location':'first', 'country':'first',
'exit time':'first', 'value 1':'first', 'value 2':'first',
'Tag1':'sum', 'Tag2':'sum', 'Tag3':'sum', 'Tag4':'sum',
'Tag5':'sum', 'Tag6':'sum', 'Tag7':'sum', 'Tag8':'sum'})
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.