将聚合函数分组为pandas中的新字段

Question

If I do the following group by on a mysql table 如果我在mysql表上执行以下组

SELECT col1, count(col2) * count(distinct(col3)) as agg_col
FROM my_table
GROUP BY col1

what I get is a table with three columns 我得到的是一个有三列的表

col1 col2 agg_col

How can I do the same on a pandas dataframe? 如何在pandas数据帧上执行相同的操作？

Suppose I have a Dataframe that has three columns col1 col2 and col3. 假设我有一个Dataframe，它有三列col1 col2和col3。 Group by operation 按操作分组

grouped = my_df.groupby('col1')

will returned the data grouped by col1 将返回按col1分组的数据

Also 也

agg_col_series = grouped.col2.size() * grouped.col3.nunique()

will return the aggregated column equivalent to the one on the sql query. 将返回与sql查询上的聚合列等效的聚合列。 But how can I add this on the grouped dataframe? 但是，如何在分组数据框中添加此内容？

Answer 1

Let's use groupby with a lambda function that uses size and nunique then rename the series to 'agg_col' and reset_index to get a dataframe. 让我们使用groupby和一个使用size和nunique的lambda函数，然后rename系列rename为'agg_col'和reset_index来获取数据帧。

import pandas as pd
import numpy as np

np.random.seed(443)
df = pd.DataFrame({'Col1':np.random.choice(['A','B','C'],50),
                   'Col2':np.random.randint(1000,9999,50),
                   'Col3':np.random.choice(['A','B','C','D','E','F','G','H','I','J'],50)})

df_out = df.groupby('Col1').apply(lambda x: x.Col2.size * x.Col3.nunique()).rename('agg_col').reset_index()

Output: 输出：

  Col1  agg_col
0    A      120
1    B       96
2    C      190

Answer 2

We'd need to see your data to be sure, but I think you need to simply reset the index of your agg_col_series : 我们需要确定您的数据，但我认为您只需要重置agg_col_series的索引：

agg_col_series.reset_index(name='agg_col')

Full example with dummy data: 虚拟数据的完整示例：

import random
import pandas as pd

col1 = [random.randint(1,5) for x in range(1,1000)]
col2 = [random.randint(1,100) for x in range(1,1000)]
col3 = [random.randint(1,100) for x in range(1,1000)]

df = pd.DataFrame(data={
        'col1': col1,
        'col2': col2,
        'col3': col3,
    })

grouped = df.groupby('col1')
agg_col_series = grouped.col2.size() * grouped.col3.nunique()

print agg_col_series.reset_index(name='agg_col')

index   col1  agg_col
0       1    15566
1       2    20056
2       3    17313
3       4    17304
4       5    16380

将聚合函数分组为pandas中的新字段

问题描述

2 个解决方案

解决方案1
1 2017-07-01 15:02:52

解决方案2
1 已采纳 2017-07-01 15:07:52

将聚合函数分组为pandas中的新字段

问题描述

2 个解决方案

解决方案1 1 2017-07-01 15:02:52

解决方案2 1 已采纳 2017-07-01 15:07:52

解决方案1
1 2017-07-01 15:02:52

解决方案2
1 已采纳 2017-07-01 15:07:52