如何在熊貓的時間序列中按一個或多個維度分組？

Question

我有類似的數據：

timestamp, country_code,  request_type,   latency
2013-10-10-13:40:01,  1,    get_account,    134
2013-10-10-13:40:63,  34,   get_account,    256
2013-10-10-13:41:09,  230,  modify_account, 589
2013-10-10-13:41:12,  230,  get_account,    43
2013-10-10-13:53:12,  1,    modify_account, 1003

時間戳處於第二解析度並且不規則。

如何在大熊貓中表達查詢，例如：

每個國家/地區代碼在10分鍾的分辨率下有多少個請求？
在1分鍾的分辨率下，按request_type的百分比延遲為99％？
每個國家/地區代碼和request_type的請求數量（分辨率為10分鍾）？

然后在同一張圖上繪制所有組的圖，每個組隨時間變化為自己的線。

更新：

根據對1的建議，我有：

bycc = df.groupby('country_code').reason.resample('10T', how='count')
bycc.plot() # BAD: uses (country_code, timestamp) on the x axis
bycc[1].plot() # properly graphs the time-series for country_code=1

但似乎找不到簡單的方法來將每個country_code繪制為單獨的線，並在x軸上帶有適當的時間戳，在y上帶有值。 我認為有2個問題（1）每個country_code的時間戳都不相同，它們需要在同一開始/結束處對齊，並且（2）需要找到正確的API /方法以從多索引TimeSeries對象中獲取到多點索引的每個第一個值有1條線的單個圖。 以我的方式努力...

更新2

以下似乎可以做到這一點：

i = 0
max = 3
pylab.rcParams['figure.figsize'] = (20.0, 10.0) # get bigger graph
for cc in bycc.index.levels[0]:
    i = i + 1
    if (i <= max):
        cclabel = "cc=%d" % (cc)
        bycc[cc].plot(legend=True, label=cclabel)

僅打印最大值，因為它會變得嘈雜。 現在繼續研究如何更好地顯示具有多個時間序列的圖。

Answer 1

注意：熊貓無法解析日期時間字符串“ 2013-10-10-13：40：63”，因為分鍾dateutil 4秒（ dateutil無法解析；熊貓使用dateutil解析日期）。 為了便於說明，我將其轉換為“ 2013-10-10-13：40：59”。

1.每請求數`country_code`在10分鍾分辨率：

In [83]: df
Out[83]:
                     country_code    request_type  latency
timestamp
2013-10-10 13:40:01             1     get_account      134
2013-10-10 13:40:59            34     get_account      256
2013-10-10 13:41:09           230  modify_account      589
2013-10-10 13:41:12           230     get_account       43
2013-10-10 13:53:12             1  modify_account     1003

In [100]: df.groupby('country_code').request_type.resample('10T', how='count')
Out[100]:
country_code  timestamp
1             2013-10-10 13:40:00    1
              2013-10-10 13:50:00    1
34            2013-10-10 13:40:00    1
230           2013-10-10 13:40:00    2
dtype: int64

2.在1分鍾的分辨率下，按`request_type`的`latency`的99％

在這里也可以采用非常類似的方法：

In [107]: df.groupby('request_type').latency.resample('T', how=lambda x: x.quantile(0.99))
Out[107]:
request_type    timestamp
get_account     2013-10-10 13:40:00     254.78
                2013-10-10 13:41:00      43.00
modify_account  2013-10-10 13:41:00     589.00
                2013-10-10 13:42:00        NaN
                2013-10-10 13:43:00        NaN
                2013-10-10 13:44:00        NaN
                2013-10-10 13:45:00        NaN
                2013-10-10 13:46:00        NaN
                2013-10-10 13:47:00        NaN
                2013-10-10 13:48:00        NaN
                2013-10-10 13:49:00        NaN
                2013-10-10 13:50:00        NaN
                2013-10-10 13:51:00        NaN
                2013-10-10 13:52:00        NaN
                2013-10-10 13:53:00    1003.00
dtype: float64

3.每人數請求`country_code`和`request_type`在10分鍾解決

這與＃1基本相同，除了您要在對DataFrame.groupby的調用中添加一個附加組：

In [108]: df.groupby(['country_code', 'request_type']).request_type.resample('10T', how='count')
Out[108]:
country_code  request_type    timestamp
1             get_account     2013-10-10 13:40:00    1
              modify_account  2013-10-10 13:50:00    1
34            get_account     2013-10-10 13:40:00    1
230           get_account     2013-10-10 13:40:00    1
              modify_account  2013-10-10 13:40:00    1
dtype: int64

目前還不清楚您要什么，請詳細說明。

如何在熊貓的時間序列中按一個或多個維度分組？

問題描述

1 個解決方案

解決方案1
6 已采納 2013-10-10 17:55:54

1.每請求數`country_code`在10分鍾分辨率：

2.在1分鍾的分辨率下，按`request_type`的`latency`的99％

3.每人數請求`country_code`和`request_type`在10分鍾解決

如何在熊貓的時間序列中按一個或多個維度分組？

問題描述

1 個解決方案

解決方案1 6 已采納 2013-10-10 17:55:54

1.每請求數country_code在10分鍾分辨率：

2.在1分鍾的分辨率下，按request_type的latency的99％

3.每人數請求country_code和request_type在10分鍾解決

解決方案1
6 已采納 2013-10-10 17:55:54

1.每請求數`country_code`在10分鍾分辨率：

2.在1分鍾的分辨率下，按`request_type`的`latency`的99％

3.每人數請求`country_code`和`request_type`在10分鍾解決