[英]Group by with aggregation function as new field in pandas
If I do the following group by on a mysql table 如果我在mysql表上执行以下组
SELECT col1, count(col2) * count(distinct(col3)) as agg_col
FROM my_table
GROUP BY col1
what I get is a table with three columns 我得到的是一个有三列的表
col1 col2 agg_col
How can I do the same on a pandas dataframe? 如何在pandas数据帧上执行相同的操作?
Suppose I have a Dataframe that has three columns col1 col2 and col3. 假设我有一个Dataframe,它有三列col1 col2和col3。 Group by operation
按操作分组
grouped = my_df.groupby('col1')
will returned the data grouped by col1 将返回按col1分组的数据
Also 也
agg_col_series = grouped.col2.size() * grouped.col3.nunique()
will return the aggregated column equivalent to the one on the sql query. 将返回与sql查询上的聚合列等效的聚合列。 But how can I add this on the grouped dataframe?
但是,如何在分组数据框中添加此内容?
Let's use groupby
with a lambda function that uses size
and nunique
then rename
the series to 'agg_col' and reset_index
to get a dataframe. 让我们使用
groupby
和一个使用size
和nunique
的lambda函数,然后rename
系列rename
为'agg_col'和reset_index
来获取数据帧。
import pandas as pd
import numpy as np
np.random.seed(443)
df = pd.DataFrame({'Col1':np.random.choice(['A','B','C'],50),
'Col2':np.random.randint(1000,9999,50),
'Col3':np.random.choice(['A','B','C','D','E','F','G','H','I','J'],50)})
df_out = df.groupby('Col1').apply(lambda x: x.Col2.size * x.Col3.nunique()).rename('agg_col').reset_index()
Output: 输出:
Col1 agg_col
0 A 120
1 B 96
2 C 190
We'd need to see your data to be sure, but I think you need to simply reset the index of your agg_col_series
: 我们需要确定您的数据,但我认为您只需要重置
agg_col_series
的索引:
agg_col_series.reset_index(name='agg_col')
Full example with dummy data: 虚拟数据的完整示例:
import random
import pandas as pd
col1 = [random.randint(1,5) for x in range(1,1000)]
col2 = [random.randint(1,100) for x in range(1,1000)]
col3 = [random.randint(1,100) for x in range(1,1000)]
df = pd.DataFrame(data={
'col1': col1,
'col2': col2,
'col3': col3,
})
grouped = df.groupby('col1')
agg_col_series = grouped.col2.size() * grouped.col3.nunique()
print agg_col_series.reset_index(name='agg_col')
index col1 agg_col
0 1 15566
1 2 20056
2 3 17313
3 4 17304
4 5 16380
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.