Pandas，groupby並在組中找到最大值，返回值和計數

Question

我有一個帶有日志數據的pandas DataFrame：

        host service
0   this.com    mail
1   this.com    mail
2   this.com     web
3   that.com    mail
4  other.net    mail
5  other.net     web
6  other.net     web

我想在每個主機上找到提供最多錯誤的服務：

        host service  no
0   this.com    mail   2
1   that.com    mail   1
2  other.net     web   2

我找到的唯一解決方案是按主機和服務進行分組，然后迭代索引的0級。

誰能建議一個更好，更短的版本？ 沒有迭代？

df = df_logfile.groupby(['host','service']).agg({'service':np.size})

df_count = pd.DataFrame()
df_count['host'] = df_logfile['host'].unique()
df_count['service']  = np.nan
df_count['no']    = np.nan

for h,data in df.groupby(level=0):
  i = data.idxmax()[0]   
  service = i[1]             
  no = data.xs(i)[0]
  df_count.loc[df_count['host'] == h, 'service'] = service
  df_count.loc[(df_count['host'] == h) & (df_count['service'] == service), 'no']   = no

完整代碼https://gist.github.com/bjelline/d8066de66e305887b714

Answer 1

給定df ，下一步是單獨按host值分組
由idxmax 。 這為您提供了對應最大服務值的索引。 然后，您可以使用df.loc[...]選擇df中與最大服務值對應的行：

import numpy as np
import pandas as pd

df_logfile = pd.DataFrame({ 
    'host' : ['this.com', 'this.com', 'this.com', 'that.com', 'other.net', 
              'other.net', 'other.net'],
    'service' : ['mail', 'mail', 'web', 'mail', 'mail', 'web', 'web' ] })

df = df_logfile.groupby(['host','service'])['service'].agg({'no':'count'})
mask = df.groupby(level=0).agg('idxmax')
df_count = df.loc[mask['no']]
df_count = df_count.reset_index()
print("\nOutput\n{}".format(df_count))

產生DataFrame

        host service  no
0  other.net     web   2
1   that.com    mail   1
2   this.com    mail   2

Pandas，groupby並在組中找到最大值，返回值和計數

問題描述

1 個解決方案

解決方案1
4 已采納 2014-11-02 17:19:01

Pandas，groupby並在組中找到最大值，返回值和計數

問題描述

1 個解決方案

解決方案1 4 已采納 2014-11-02 17:19:01

解決方案1
4 已采納 2014-11-02 17:19:01