简体   繁体   English

Pandas,groupby并在组中找到最大值,返回值和计数

[英]Pandas, groupby and finding maximum in groups, returning value and count

I have a pandas DataFrame with log data: 我有一个带有日志数据的pandas DataFrame:

        host service
0   this.com    mail
1   this.com    mail
2   this.com     web
3   that.com    mail
4  other.net    mail
5  other.net     web
6  other.net     web

And I want to find the service on every host that gives the most errors: 我想在每个主机上找到提供最多错误的服务:

        host service  no
0   this.com    mail   2
1   that.com    mail   1
2  other.net     web   2

The only solution I found was grouping by host and service, and then iterating over the level 0 of the index. 我找到的唯一解决方案是按主机和服务进行分组,然后迭代索引的0级。

Can anyone suggest a better, shorter version? 谁能建议一个更好,更短的版本? without the Iteration? 没有迭代?

df = df_logfile.groupby(['host','service']).agg({'service':np.size})

df_count = pd.DataFrame()
df_count['host'] = df_logfile['host'].unique()
df_count['service']  = np.nan
df_count['no']    = np.nan

for h,data in df.groupby(level=0):
  i = data.idxmax()[0]   
  service = i[1]             
  no = data.xs(i)[0]
  df_count.loc[df_count['host'] == h, 'service'] = service
  df_count.loc[(df_count['host'] == h) & (df_count['service'] == service), 'no']   = no

full code https://gist.github.com/bjelline/d8066de66e305887b714 完整代码https://gist.github.com/bjelline/d8066de66e305887b714

Given df , the next step is to group by the host value alone and 给定df ,下一步是单独按host值分组
aggregate by idxmax . idxmax This gives you the index which corresponds the the greatest service value. 这为您提供了对应最大服务值的索引。 You can then use df.loc[...] to select the rows in df which correspond to the greatest service values: 然后,您可以使用df.loc[...]选择df中与最大服务值对应的行:

import numpy as np
import pandas as pd

df_logfile = pd.DataFrame({ 
    'host' : ['this.com', 'this.com', 'this.com', 'that.com', 'other.net', 
              'other.net', 'other.net'],
    'service' : ['mail', 'mail', 'web', 'mail', 'mail', 'web', 'web' ] })

df = df_logfile.groupby(['host','service'])['service'].agg({'no':'count'})
mask = df.groupby(level=0).agg('idxmax')
df_count = df.loc[mask['no']]
df_count = df_count.reset_index()
print("\nOutput\n{}".format(df_count))

yields the DataFrame 产生DataFrame

        host service  no
0  other.net     web   2
1   that.com    mail   1
2   this.com    mail   2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM