在Python pandas中自定義rolling_apply函數

Question

設定

我有一個包含三列的DataFrame：

“Category”包含True和False，我已經完成了df.groupby('Category')按這些值分組。
“時間”包含已記錄值的時間戳（以秒為單位）
“值”包含值本身。

在每個時間實例，記錄兩個值：一個具有“True”類別，另一個具有“False”類別。

滾動申請問題

在每個類別組中 ，我想計算一個數字並將其存儲在每次結果列中 。結果是時間t-60和t之間的值在1到3之間的百分比。

實現此目的的最簡單方法可能是通過rolling_count計算該時間間隔內的值的總數，然后執行rolling_apply以僅計算該區間rolling_count於1和3之間的值。

到目前為止，這是我的代碼：

groups = df.groupby(['Category'])
for key, grp in groups:
    grp = grp.reindex(grp['Time']) # reindex by time so we can count with rolling windows
    grp['total'] = pd.rolling_count(grp['Value'], window=60) # count number of values in the last 60 seconds
    grp['in_interval'] = ? ## Need to count number of values where 1<v<3 in the last 60 seconds

    grp['Result'] = grp['in_interval'] / grp['total'] # percentage of values between 1 and 3 in the last 60 seconds

什么是正確的rolling_apply()調用來查找grp['in_interval'] ？

Answer 1

讓我們通過一個例子：

import pandas as pd
import numpy as np
np.random.seed(1)

def setup(regular=True):
    N = 10
    x = np.arange(N)
    a = np.arange(N)
    b = np.arange(N)

    if regular:
        timestamps = np.linspace(0, 120, N)
    else:
        timestamps = np.random.uniform(0, 120, N)

    df = pd.DataFrame({
        'Category': [True]*N + [False]*N,
        'Time': np.hstack((timestamps, timestamps)),
        'Value': np.hstack((a,b))
        })
    return df

df = setup(regular=False)
df.sort(['Category', 'Time'], inplace=True)

所以DataFrame， df ，看起來像這樣：

In [4]: df
Out[4]: 
   Category       Time  Value    Result
12    False   0.013725      2  1.000000
15    False  11.080631      5  0.500000
14    False  17.610707      4  0.333333
16    False  22.351225      6  0.250000
13    False  36.279909      3  0.400000
17    False  41.467287      7  0.333333
18    False  47.612097      8  0.285714
10    False  50.042641      0  0.250000
19    False  64.658008      9  0.125000
11    False  86.438939      1  0.333333
2      True   0.013725      2  1.000000
5      True  11.080631      5  0.500000
4      True  17.610707      4  0.333333
6      True  22.351225      6  0.250000
3      True  36.279909      3  0.400000
7      True  41.467287      7  0.333333
8      True  47.612097      8  0.285714
0      True  50.042641      0  0.250000
9      True  64.658008      9  0.125000
1      True  86.438939      1  0.333333

現在，復制@herrfz，讓我們來定義

def between(a, b):
    def between_percentage(series):
        return float(len(series[(a <= series) & (series < b)])) / float(len(series))
    return between_percentage

between(1,3)的函數是一個函數，它將一個序列作為輸入，並返回位於半開區間[1,3)中的元素的分數。 例如，

In [9]: series = pd.Series([1,2,3,4,5])

In [10]: between(1,3)(series)
Out[10]: 0.4

現在我們將按Category采用DataFrame， df和group：

df.groupby(['Category'])

對於groupby對象中的每個組，我們將要應用一個函數：

df['Result'] = df.groupby(['Category']).apply(toeach_category)

函數toeach_category將（子）DataFrame作為輸入，並返回DataFrame作為輸出。 整個結果將分配給名為Result的新df列。

現在toeach_category要做什么？ 如果我們像這樣寫toeach_category ：

def toeach_category(subf):
    print(subf)

然后我們看到每個subf都是一個DataFrame，比如這個（當Category為False時）：

   Category       Time  Value    Result
12    False   0.013725      2  1.000000
15    False  11.080631      5  0.500000
14    False  17.610707      4  0.333333
16    False  22.351225      6  0.250000
13    False  36.279909      3  0.400000
17    False  41.467287      7  0.333333
18    False  47.612097      8  0.285714
10    False  50.042641      0  0.250000
19    False  64.658008      9  0.125000
11    False  86.438939      1  0.333333

我們想要使用Times列，並且每次都應用一個函數。 這是使用applymap完成的：

def toeach_category(subf):
    result = subf[['Time']].applymap(percentage)

函數percentage將采用時間值作為輸入，並返回一個值作為輸出。 值將是值為1到3的行的分數applymap非常嚴格： percentage不能采用任何其他參數。

給定時間t ，我們可以使用ix方法從subf選擇Value s，其時間在半開區間(t-60, t] ：

subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value']

因此，我們可以通過between(1,3)應用來找到1到3 between(1,3) Values的百分比：

between(1,3)(subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value'])

現在記住我們想要一個函數percentage ，它將t作為輸入並返回上面的表達式作為輸出：

def percentage(t):
    return between(1,3)(subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value'])

但請注意， percentage取決於subf ，我們不允許將subf作為參數傳遞給percentage （同樣，因為applymap非常嚴格）。

那么我們如何擺脫這種干擾呢？ 解決方案是在toeach_category定義percentage 。 Python的范圍規則說，首先在Local范圍內查找像subf這樣的裸名，然后是Enclosing范圍，Global范圍，最后是在Builtin范圍內。 當調用percentage(t)並且Python遇到subf ，Python首先在Local范圍內查找subf的值。 由於subf不是percentage的局部變量，因此Python在函數toeach_category范圍內查找它。 它在那里找到了subf 。 完善。 這正是我們所需要的。

所以現在我們有了toeach_category函數：

def toeach_category(subf):
    def percentage(t):
        return between(1, 3)(
            subf.ix[(t - 60 < subf['Time']) & (subf['Time'] <= t), 'Value'])
    result = subf[['Time']].applymap(percentage)
    return result

把它們放在一起，

import pandas as pd
import numpy as np
np.random.seed(1)


def setup(regular=True):
    N = 10
    x = np.arange(N)
    a = np.arange(N)
    b = np.arange(N)

    if regular:
        timestamps = np.linspace(0, 120, N)
    else:
        timestamps = np.random.uniform(0, 120, N)

    df = pd.DataFrame({
        'Category': [True] * N + [False] * N,
        'Time': np.hstack((timestamps, timestamps)),
        'Value': np.hstack((a, b))
    })
    return df


def between(a, b):
    def between_percentage(series):
        return float(len(series[(a <= series) & (series < b)])) / float(len(series))
    return between_percentage


def toeach_category(subf):
    def percentage(t):
        return between(1, 3)(
            subf.ix[(t - 60 < subf['Time']) & (subf['Time'] <= t), 'Value'])
    result = subf[['Time']].applymap(percentage)
    return result


df = setup(regular=False)
df.sort(['Category', 'Time'], inplace=True)
df['Result'] = df.groupby(['Category']).apply(toeach_category)
print(df)

產量

   Category       Time  Value    Result
12    False   0.013725      2  1.000000
15    False  11.080631      5  0.500000
14    False  17.610707      4  0.333333
16    False  22.351225      6  0.250000
13    False  36.279909      3  0.200000
17    False  41.467287      7  0.166667
18    False  47.612097      8  0.142857
10    False  50.042641      0  0.125000
19    False  64.658008      9  0.000000
11    False  86.438939      1  0.166667
2      True   0.013725      2  1.000000
5      True  11.080631      5  0.500000
4      True  17.610707      4  0.333333
6      True  22.351225      6  0.250000
3      True  36.279909      3  0.200000
7      True  41.467287      7  0.166667
8      True  47.612097      8  0.142857
0      True  50.042641      0  0.125000
9      True  64.658008      9  0.000000
1      True  86.438939      1  0.166667

Answer 2

如果我正確理解您的問題陳述，如果您僅為計算百分比而使用它，則可能會跳過rolling count 。 rolling_apply將執行聚合的函數作為參數，即將數組作為輸入並將數字作為輸出返回的函數。

考慮到這一點，讓我們首先定義一個函數：

def between_1_3_perc(x):
    # pandas Series is basically a numpy array, we can do boolean indexing
    return float(len(x[(x > 1) & (x < 3)])) / float(len(x))

然后在for循環中使用函數名作為rolling_apply的參數：

grp['Result'] = pd.rolling_apply(grp['Value'], 60, between_1_3_perc)

在Python pandas中自定義rolling_apply函數

問題描述

設定

滾動申請問題

2 個解決方案

解決方案1
7 已采納 2013-03-18 21:52:32

解決方案2
2 2013-03-18 21:55:56

在Python pandas中自定義rolling_apply函數

問題描述

設定

滾動申請問題

2 個解決方案

解決方案1 7 已采納 2013-03-18 21:52:32

解決方案2 2 2013-03-18 21:55:56

解決方案1
7 已采納 2013-03-18 21:52:32

解決方案2
2 2013-03-18 21:55:56