简体   繁体   English

在Python pandas中自定义rolling_apply函数

[英]Customizing rolling_apply function in Python pandas

Setup 设定

I have a DataFrame with three columns: 我有一个包含三列的DataFrame:

  • "Category" contains True and False, and I have done df.groupby('Category') to group by these values. “Category”包含True和False,我已经完成了df.groupby('Category')按这些值分组。
  • "Time" contains timestamps (measured in seconds) at which values have been recorded “时间”包含已记录值的时间戳(以秒为单位)
  • "Value" contains the values themselves. “值”包含值本身。

At each time instance, two values are recorded: one has category "True", and the other has category "False". 在每个时间实例,记录两个值:一个具有“True”类别,另一个具有“False”类别。

Rolling apply question 滚动申请问题

Within each category group , I want to compute a number and store it in column Result for each time . 在每个类别组中 ,我想计算一个数字并将其存储在每次结果列中 Result is the percentage of values between time t-60 and t that fall between 1 and 3. 结果是时间t-60t之间的值在1到3之间的百分比。

The easiest way to accomplish this is probably to calculate the total number of values in that time interval via rolling_count , then execute rolling_apply to count only the values from that interval that fall between 1 and 3. 实现此目的的最简单方法可能是通过rolling_count计算该时间间隔内的值的总数,然后执行rolling_apply以仅计算该区间rolling_count于1和3之间的值。

Here is my code so far: 到目前为止,这是我的代码:

groups = df.groupby(['Category'])
for key, grp in groups:
    grp = grp.reindex(grp['Time']) # reindex by time so we can count with rolling windows
    grp['total'] = pd.rolling_count(grp['Value'], window=60) # count number of values in the last 60 seconds
    grp['in_interval'] = ? ## Need to count number of values where 1<v<3 in the last 60 seconds

    grp['Result'] = grp['in_interval'] / grp['total'] # percentage of values between 1 and 3 in the last 60 seconds

What is the proper rolling_apply() call to find grp['in_interval'] ? 什么是正确的rolling_apply()调用来查找grp['in_interval']

Let's work through an example: 让我们通过一个例子:

import pandas as pd
import numpy as np
np.random.seed(1)

def setup(regular=True):
    N = 10
    x = np.arange(N)
    a = np.arange(N)
    b = np.arange(N)

    if regular:
        timestamps = np.linspace(0, 120, N)
    else:
        timestamps = np.random.uniform(0, 120, N)

    df = pd.DataFrame({
        'Category': [True]*N + [False]*N,
        'Time': np.hstack((timestamps, timestamps)),
        'Value': np.hstack((a,b))
        })
    return df

df = setup(regular=False)
df.sort(['Category', 'Time'], inplace=True)

So the DataFrame, df , looks like this: 所以DataFrame, df ,看起来像这样:

In [4]: df
Out[4]: 
   Category       Time  Value    Result
12    False   0.013725      2  1.000000
15    False  11.080631      5  0.500000
14    False  17.610707      4  0.333333
16    False  22.351225      6  0.250000
13    False  36.279909      3  0.400000
17    False  41.467287      7  0.333333
18    False  47.612097      8  0.285714
10    False  50.042641      0  0.250000
19    False  64.658008      9  0.125000
11    False  86.438939      1  0.333333
2      True   0.013725      2  1.000000
5      True  11.080631      5  0.500000
4      True  17.610707      4  0.333333
6      True  22.351225      6  0.250000
3      True  36.279909      3  0.400000
7      True  41.467287      7  0.333333
8      True  47.612097      8  0.285714
0      True  50.042641      0  0.250000
9      True  64.658008      9  0.125000
1      True  86.438939      1  0.333333

Now, copying @herrfz, let's define 现在,复制@herrfz,让我们来定义

def between(a, b):
    def between_percentage(series):
        return float(len(series[(a <= series) & (series < b)])) / float(len(series))
    return between_percentage

between(1,3) is a function which takes a Series as input and returns the fraction of its elements which lie in the half-open interval [1,3) . between(1,3)的函数是一个函数,它将一个序列作为输入,并返回位于半开区间[1,3)中的元素的分数。 For example, 例如,

In [9]: series = pd.Series([1,2,3,4,5])

In [10]: between(1,3)(series)
Out[10]: 0.4

Now we are going to take our DataFrame, df , and group by Category : 现在我们将按Category采用DataFrame, df和group:

df.groupby(['Category'])

For each group in the groupby object, we will want to apply a function: 对于groupby对象中的每个组,我们将要应用一个函数:

df['Result'] = df.groupby(['Category']).apply(toeach_category)

The function, toeach_category , will take a (sub)DataFrame as input, and return a DataFrame as output. 函数toeach_category将(子)DataFrame作为输入,并返回DataFrame作为输出。 The entire result will be assigned to a new column of df called Result . 整个结果将分配给名为Result的新df列。

Now what exactly must toeach_category do? 现在toeach_category要做什么? If we write toeach_category like this: 如果我们像这样写toeach_category

def toeach_category(subf):
    print(subf)

then we see each subf is a DataFrame such as this one (when Category is False): 然后我们看到每个subf都是一个DataFrame,比如这个(当Category为False时):

   Category       Time  Value    Result
12    False   0.013725      2  1.000000
15    False  11.080631      5  0.500000
14    False  17.610707      4  0.333333
16    False  22.351225      6  0.250000
13    False  36.279909      3  0.400000
17    False  41.467287      7  0.333333
18    False  47.612097      8  0.285714
10    False  50.042641      0  0.250000
19    False  64.658008      9  0.125000
11    False  86.438939      1  0.333333

We want to take the Times column, and for each time , apply a function. 我们想要使用Times列,并且每次都应用一个函数。 That's done with applymap : 这是使用applymap完成的:

def toeach_category(subf):
    result = subf[['Time']].applymap(percentage)

The function percentage will take a time value as input, and return a value as output. 函数percentage将采用时间值作为输入,并返回一个值作为输出。 The value will be the fraction of rows with values between 1 and 3. applymap is very strict: percentage can not take any other arguments. 值将是值为1到3的行的分数applymap非常严格: percentage不能采用任何其他参数。

Given a time t , we can select the Value s from subf whose times are in the half-open interval (t-60, t] using the ix method: 给定时间t ,我们可以使用ix方法从subf选择Value s,其时间在半开区间(t-60, t]

subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value']

And so we can find the percentage of those Values between 1 and 3 by applying between(1,3) : 因此,我们可以通过between(1,3)应用来找到1到3 between(1,3) Values的百分比:

between(1,3)(subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value'])

Now remember that we want a function percentage which takes t as input and returns the above expression as output: 现在记住我们想要一个函数percentage ,它将t作为输入并返回上面的表达式作为输出:

def percentage(t):
    return between(1,3)(subf.ix[(t-60 < subf['Time']) & (subf['Time'] <= t), 'Value'])

But notice that percentage depends on subf , and we are not allowed to pass subf to percentage as an argument (again, because applymap is very strict). 但请注意, percentage取决于subf ,我们不允许将subf作为参数传递给percentage (同样,因为applymap非常严格)。

So how do we get out of this jam? 那么我们如何摆脱这种干扰呢? The solution is to define percentage inside toeach_category . 解决方案是在toeach_category定义percentage Python's scoping rules say that a bare name like subf is first looked for in the Local scope, then the Enclosing scope, the the Global scope, and lastly in the Builtin scope. Python的范围规则说,首先在Local范围内查找像subf这样的裸名,然后是Enclosing范围,Global范围,最后是在Builtin范围内。 When percentage(t) is called, and Python encounters subf , Python first looks in the Local scope for the value of subf . 当调用percentage(t)并且Python遇到subf ,Python首先在Local范围内查找subf的值。 Since subf is not a local variable in percentage , Python looks for it in the Enclosing scope of the function toeach_category . 由于subf不是percentage的局部变量,因此Python在函数toeach_category范围内查找它。 It finds subf there. 它在那里找到了subf Perfect. 完善。 That is just what we need. 这正是我们所需要的。

So now we have our function toeach_category : 所以现在我们有了toeach_category函数:

def toeach_category(subf):
    def percentage(t):
        return between(1, 3)(
            subf.ix[(t - 60 < subf['Time']) & (subf['Time'] <= t), 'Value'])
    result = subf[['Time']].applymap(percentage)
    return result

Putting it all together, 把它们放在一起,

import pandas as pd
import numpy as np
np.random.seed(1)


def setup(regular=True):
    N = 10
    x = np.arange(N)
    a = np.arange(N)
    b = np.arange(N)

    if regular:
        timestamps = np.linspace(0, 120, N)
    else:
        timestamps = np.random.uniform(0, 120, N)

    df = pd.DataFrame({
        'Category': [True] * N + [False] * N,
        'Time': np.hstack((timestamps, timestamps)),
        'Value': np.hstack((a, b))
    })
    return df


def between(a, b):
    def between_percentage(series):
        return float(len(series[(a <= series) & (series < b)])) / float(len(series))
    return between_percentage


def toeach_category(subf):
    def percentage(t):
        return between(1, 3)(
            subf.ix[(t - 60 < subf['Time']) & (subf['Time'] <= t), 'Value'])
    result = subf[['Time']].applymap(percentage)
    return result


df = setup(regular=False)
df.sort(['Category', 'Time'], inplace=True)
df['Result'] = df.groupby(['Category']).apply(toeach_category)
print(df)

yields 产量

   Category       Time  Value    Result
12    False   0.013725      2  1.000000
15    False  11.080631      5  0.500000
14    False  17.610707      4  0.333333
16    False  22.351225      6  0.250000
13    False  36.279909      3  0.200000
17    False  41.467287      7  0.166667
18    False  47.612097      8  0.142857
10    False  50.042641      0  0.125000
19    False  64.658008      9  0.000000
11    False  86.438939      1  0.166667
2      True   0.013725      2  1.000000
5      True  11.080631      5  0.500000
4      True  17.610707      4  0.333333
6      True  22.351225      6  0.250000
3      True  36.279909      3  0.200000
7      True  41.467287      7  0.166667
8      True  47.612097      8  0.142857
0      True  50.042641      0  0.125000
9      True  64.658008      9  0.000000
1      True  86.438939      1  0.166667

If I understand your problem statement correctly, you could probably skip rolling count if you use it only for the sake of computing the percentage. 如果我正确理解您的问题陈述,如果您仅为计算百分比而使用它,则可能会跳过rolling count rolling_apply takes as an argument a function that performs aggregation, ie a function that takes an array as input and returns a number as an output. rolling_apply将执行聚合的函数作为参数,即将数组作为输入并将数字作为输出返回的函数。

Having this in mind, let's first define a function: 考虑到这一点,让我们首先定义一个函数:

def between_1_3_perc(x):
    # pandas Series is basically a numpy array, we can do boolean indexing
    return float(len(x[(x > 1) & (x < 3)])) / float(len(x))

Then use the function name as an argument of rolling_apply in the for-loop: 然后在for循环中使用函数名作为rolling_apply的参数:

grp['Result'] = pd.rolling_apply(grp['Value'], 60, between_1_3_perc)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM