[英]Rolling average based on another column
I have a dataframe df which looks like我有一个数据框 df 看起来像
time(float)![]() |
value (float)![]() |
---|---|
10.45 ![]() |
10 ![]() |
10.50 ![]() |
20 ![]() |
10.55 ![]() |
25 ![]() |
11.20 ![]() |
30 ![]() |
11.44 ![]() |
20 ![]() |
12.30 ![]() |
30 ![]() |
I need help to calculate a new column called rolling_average_value which is basically the average value of that row and all the values 1 hour before that row such that the new dataframe looks like.我需要帮助来计算一个名为 rolling_average_value 的新列,它基本上是该行的平均值以及该行之前 1 小时的所有值,以便新数据框看起来像。
time(float)![]() |
value (float)![]() |
rolling_average_value![]() |
---|---|---|
10.45 ![]() |
10 ![]() |
10 ![]() |
10.50 ![]() |
20 ![]() |
15 ![]() |
10.55 ![]() |
25 ![]() |
18.33 ![]() |
11.20 ![]() |
30 ![]() |
21.25 ![]() |
11.44 ![]() |
20 ![]() |
21 ![]() |
12.30 ![]() |
30 ![]() |
25 ![]() |
Note: This time column is a float column注意:这个时间列是一个浮点列
You can temporarily set a datetime index and apply rolling.mean
:您可以临时设置日期时间索引并应用
rolling.mean
:
# extract hours/minuts from float
import numpy as np
minutes, hours = np.modf(df['time(float)'])
hours = hours.astype(int)
minutes = minutes.mul(100).astype(int)
dt = pd.to_datetime(hours.astype(str)+minutes.astype(str), format='%H%M')
# perform rolling computation
df['rolling_mean'] = (df.set_axis(dt)
.rolling('1h')['value (float)']
.mean()
.set_axis(df.index)
)
output:输出:
time(float) value (float) rolling_mean
0 10.45 10 10.000000
1 10.50 20 15.000000
2 10.55 25 18.333333
3 11.20 30 21.250000
4 11.44 20 21.000000
5 12.30 30 25.000000
Alternative to compute dt
:计算
dt
的替代方法:
dt = pd.to_datetime(df['time(float)'].astype(str)
.str.replace('\d+', lambda x: x.group().zfill(2),
regex=True),
format='%H.%M')
Assuming your data frame is sorted by time, you can also use a simple list comprehension to solve your problem.假设您的数据框按时间排序,您还可以使用简单的列表推导来解决您的问题。 Iterate over
times
and get all indices where the distance from the previous time values to the actual iteration value is less than one (meaning less than one hour) and slice the value
column that was converted to an array by those indices.迭代
times
并获取从先前时间值到实际迭代值的距离小于一(意味着小于一小时)的所有索引,并通过这些索引对转换为数组的value
列进行切片。 Then, you can just compute the mean of the sliced array:然后,您可以计算切片数组的平均值:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{"time": [10.45, 10.5, 10.55, 11.2, 11.44, 12.3],
"value": [10, 20, 25, 30, 20, 30]}
)
times = df["time"].values
values = df["value"].values
df["rolling_mean"] = [round(np.mean(values[np.where(times[i] - times[:i+1] < 1)[0]]), 2) for i in range(len(times))]
If your data frame is large, you can compile this loop in C/C++ too make it significantly faster:如果您的数据框很大,您可以在 C/C++ 中编译此循环,使其显着更快:
from numba import njit
@njit
def compute_rolling_mean(times, values):
return [round(np.mean(values[np.where(times[i] - times[:i+1] < 1)[0]]), 2) for i in range(len(times))]
df["rolling_mean"] = compute_rolling_mean(df["time"].values, df["value"].values)
Output:输出:
time value rolling_mean
0 10.45 10 10.00
1 10.50 20 15.00
2 10.55 25 18.33
3 11.20 30 21.25
4 11.44 20 21.00
5 12.30 30 25.00
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.