[英]Conditional mean and sum of previous N rows in pandas dataframe
Concerned is this exemplary pandas dataframe: 关注的是这个示范性的熊猫数据帧:
Measurement Trigger Valid
0 2.0 False True
1 4.0 False True
2 3.0 False True
3 0.0 True False
4 100.0 False True
5 3.0 False True
6 2.0 False True
7 1.0 True True
Whenever Trigger
is True
, I wish to calculate sum and mean of the last 3 (starting from current) valid measurements. 每当
Trigger
为True
,我希望计算最后3个(从当前开始)有效测量的总和和平均值。 Measurements are considered valid, if the column Valid
is True
. 如果
Valid
True
,则测量被认为是有效的。 So let's clarify using the two examples in the above dataframe: 那么让我们使用上面数据框中的两个例子来澄清:
Index 3
: Indices 2,1,0
should be used. Index 3
:应使用指数2,1,0
。 Expected Sum = 9.0, Mean = 3.0
Sum = 9.0, Mean = 3.0
Index 7
: Indices 7,6,5
should be used. Index 7
:应使用指数7,6,5
。 Expected Sum = 6.0, Mean = 2.0
Sum = 6.0, Mean = 2.0
I have tried pandas.rolling
and creating new, shifted columns, but was not successful. 我尝试过
pandas.rolling
并创建新的移位列,但没有成功。 See the following excerpt from my tests (which should directly run): 请参阅我的测试中的以下摘录(应该直接运行):
import unittest
import pandas as pd
import numpy as np
from pandas.util.testing import assert_series_equal
def create_sample_dataframe_2():
df = pd.DataFrame(
{"Measurement" : [2.0, 4.0, 3.0, 0.0, 100.0, 3.0, 2.0, 1.0 ],
"Valid" : [True, True, True, False, True, True, True, True],
"Trigger" : [False, False, False, True, False, False, False, True],
})
return df
def expected_result():
return pd.DataFrame({"Sum" : [np.nan, np.nan, np.nan, 9.0, np.nan, np.nan, np.nan, 6.0],
"Mean" :[np.nan, np.nan, np.nan, 3.0, np.nan, np.nan, np.nan, 2.0]})
class Data_Preparation_Functions(unittest.TestCase):
def test_backsummation(self):
N_SUMMANDS = 3
temp_vars = []
df = create_sample_dataframe_2()
for i in range(0,N_SUMMANDS):
temp_var = "M_{0}".format(i)
df[temp_var] = df["Measurement"].shift(i)
temp_vars.append(temp_var)
df["Sum"] = df[temp_vars].sum(axis=1)
df["Mean"] = df[temp_vars].mean(axis=1)
df.loc[(df["Trigger"]==False), "Sum"] = np.nan
df.loc[(df["Trigger"]==False), "Mean"] = np.nan
assert_series_equal(expected_result()["Sum"],df["Sum"])
assert_series_equal(expected_result()["Mean"],df["Mean"])
def test_rolling(self):
df = create_sample_dataframe_2()
df["Sum"] = df[(df["Valid"] == True)]["Measurement"].rolling(window=3).sum()
df["Mean"] = df[(df["Valid"] == True)]["Measurement"].rolling(window=3).mean()
df.loc[(df["Trigger"]==False), "Sum"] = np.nan
df.loc[(df["Trigger"]==False), "Mean"] = np.nan
assert_series_equal(expected_result()["Sum"],df["Sum"])
assert_series_equal(expected_result()["Mean"],df["Mean"])
if __name__ == '__main__':
suite = unittest.TestLoader().loadTestsFromTestCase(Data_Preparation_Functions)
unittest.TextTestRunner(verbosity=2).run(suite)
Any help or solution is greatly appreciated. 非常感谢任何帮助或解决方案。 Thanks and Cheers!
谢谢,干杯!
EDIT: Clarification: This is the resulting dataframe I expect: 编辑:澄清:这是我期望的结果数据帧:
Measurement Trigger Valid Sum Mean
0 2.0 False True NaN NaN
1 4.0 False True NaN NaN
2 3.0 False True NaN NaN
3 0.0 True False 9.0 3.0
4 100.0 False True NaN NaN
5 3.0 False True NaN NaN
6 2.0 False True NaN NaN
7 1.0 True True 6.0 2.0
EDIT2: Another clarification: 编辑2:另一个澄清:
I did indeed not miscalculate, but rather I did not make my intentions as clear as I could have. 我确实没有计算错误,而是我没有尽可能明确地表达我的意图。 Here's another try using the same dataframe:
这是使用相同数据帧的另一个尝试:
Let's first look at the Trigger
column: We find the first True
in index 3 (green rectangle). 让我们首先看一下
Trigger
列:我们在索引3(绿色矩形)中找到第一个True
。 So index 3 is the point, where we start looking. 所以索引3是我们开始寻找的点。 There is no valid measurement at index 3 (Column
Valid
is False
; red rectangle). 索引3处没有有效测量值(Column
Valid
为False
;红色矩形)。 So, we start to go further back in time, until we have accumulated three lines, where Valid
is True
. 所以,我们开始回到过去,直到我们累积了三行,其中
Valid
是True
。 This happens for indices 2,1 and 0. For these three indices, we calculate the sum and mean of the column Measurement
(blue rectangle): 对于索引2,1和0,会发生这种情况。对于这三个索引,我们计算列
Measurement
(蓝色矩形)的总和和平均值:
Now we start the next iteration of this little algorithm: Look again for the next True
in the Trigger
column. 现在我们开始这个小算法的下一次迭代:再次查看
Trigger
列中的下一个True
。 We find it at index 7 (green rectangle). 我们在索引7(绿色矩形)找到它。 There is also a valid measuremnt at index 7, so we include it this time.
在索引7处还有一个有效的度量标准,所以我们这次包括它。 For our calculation, we use indices 7,6 and 5 (green rectangle), and thus get:
对于我们的计算,我们使用索引7,6和5(绿色矩形),从而得到:
I hope, this sheds more light on this little problem. 我希望,这会对这个小问题有所了解。
Heres an option, take the 3 period rolling mean and sum 继承人选择,采取3期滚动均值和总和
df['RollM'] = df.Measurement.rolling(window=3,min_periods=0).mean()
df['RollS'] = df.Measurement.rolling(window=3,min_periods=0).sum()
Now set False Triggers equals to NaN
现在设置False Triggers等于
NaN
df.loc[df.Trigger == False,['RollS','RollM']] = np.nan
yields 产量
Measurement Trigger Valid RollM RollS
0 2.0 False True NaN NaN
1 4.0 False True NaN NaN
2 3.0 False True NaN NaN
3 0.0 True False 2.333333 7.0
4 100.0 False True NaN NaN
5 3.0 False True NaN NaN
6 2.0 False True NaN NaN
7 1.0 True True 2.000000 6.0
Edit, updated to reflect valid argument 编辑,更新以反映有效参数
df['mean'],df['sum'] = np.nan,np.nan
roller = df.Measurement.rolling(window=3,min_periods=0).agg(['mean','sum'])
df.loc[(df.Trigger == True) & (df.Valid == True),['mean','sum']] = roller
df.loc[(df.Trigger == True) & (df.Valid == False),['mean','sum']] = roller.shift(1)
Yields 产量
Measurement Trigger Valid mean sum
0 2.0 False True NaN NaN
1 4.0 False True NaN NaN
2 3.0 False True NaN NaN
3 0.0 True False 3.0 9.0
4 100.0 False True NaN NaN
5 3.0 False True NaN NaN
6 2.0 False True NaN NaN
7 1.0 True True 2.0 6.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.