简体   繁体   English

Pandas Groupby:计数和平均值相结合

[英]Pandas Groupby: Count and mean combined

Working with pandas to try and summarise a data frame as a count of certain categories, as well as the means sentiment score for these categories.使用 pandas 尝试将数据框总结为某些类别的计数,以及这些类别的平均情感得分。

There is a table full of strings that have different sentiment scores, and I want to group each text source by saying how many posts they have, as well as the average sentiment of these posts.有一个充满不同情绪分数的字符串的表格,我想通过说明每个文本源有多少帖子以及这些帖子的平均情绪来对每个文本源进行分组。

My (simplified) data frame looks like this:我的(简化的)数据框如下所示:

source    text              sent
--------------------------------
bar       some string       0.13
foo       alt string        -0.8
bar       another str       0.7
foo       some text         -0.2
foo       more text         -0.5

The output from this should be something like this:输出应该是这样的:

source    count     mean_sent
-----------------------------
foo       3         -0.5
bar       2         0.415

The answer is somewhere along the lines of:答案大致如下:

df['sent'].groupby(df['source']).mean()

Yet only gives each source and it's mean, with no column headers.然而,只给出每个来源,它的意思,没有列标题。

You can use groupby with aggregate :您可以将groupbyaggregate一起使用:

df = df.groupby('source') \
       .agg({'text':'size', 'sent':'mean'}) \
       .rename(columns={'text':'count','sent':'mean_sent'}) \
       .reset_index()
print (df)
  source  count  mean_sent
0    bar      2      0.415
1    foo      3     -0.500

In newer versions of Panda you don't need the rename anymore, just use named parameters:在较新版本的 Panda 中,您不再需要重命名,只需使用命名参数:

df = df.groupby('source') \
       .agg(count=('text', 'size'), mean_sent=('sent', 'mean')) \
       .reset_index()

print (df)
  source  count  mean_sent
0    bar      2      0.415
1    foo      3     -0.500

下面一个应该可以正常工作:

df[['source','sent']].groupby('source').agg(['count','mean'])

A shorter version to achieve this is:实现此目的的较短版本是:

df.groupby('source')['sent'].agg(count='size', mean_sent='mean').reset_index()

The nice thing about this is that you can extend it if you want to take the mean of multiple variables but only count once.这样做的好处是,如果您想取多个变量的均值但只计算一次,则可以扩展它。 In this case you will have to pass a dictionary:在这种情况下,您将不得不传递字典:

df.groupby('source')['sent1', 'sent2'].agg({'count': 'size', 'means': 'mean'}).reset_index()

对于那些正在寻找超过两列聚合的人(就像我一样):只需将它们添加到“agg”。

df = df.groupby(['id']).agg({'texts': 'size', 'char_num': 'mean', 'bytes': 'mean'}).reset_index()

我认为这应该提供您想要的输出:

result = pd.DataFrame(df.groupby('source').size())

results['mean_score'] = df.groupby('source').sent.mean()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM