简体   繁体   English

pandas数据帧中前N行的条件均值和总和

[英]Conditional mean and sum of previous N rows in pandas dataframe

Concerned is this exemplary pandas dataframe: 关注的是这个示范性的熊猫数据帧:

      Measurement  Trigger  Valid
   0          2.0    False   True
   1          4.0    False   True
   2          3.0    False   True
   3          0.0     True  False
   4        100.0    False   True
   5          3.0    False   True
   6          2.0    False   True
   7          1.0     True   True

Whenever Trigger is True , I wish to calculate sum and mean of the last 3 (starting from current) valid measurements. 每当TriggerTrue ,我希望计算最后3个(从当前开始)有效测量的总和和平均值。 Measurements are considered valid, if the column Valid is True . 如果Valid True ,则测量被认为是有效的。 So let's clarify using the two examples in the above dataframe: 那么让我们使用上面数据框中的两个例子来澄清:

  1. Index 3 : Indices 2,1,0 should be used. Index 3 :应使用指数2,1,0 Expected Sum = 9.0, Mean = 3.0 预期Sum = 9.0, Mean = 3.0
  2. Index 7 : Indices 7,6,5 should be used. Index 7 :应使用指数7,6,5 Expected Sum = 6.0, Mean = 2.0 预期Sum = 6.0, Mean = 2.0

I have tried pandas.rolling and creating new, shifted columns, but was not successful. 我尝试过pandas.rolling并创建新的移位列,但没有成功。 See the following excerpt from my tests (which should directly run): 请参阅我的测试中的以下摘录(应该直接运行):

import unittest
import pandas as pd
import numpy as np
from pandas.util.testing import assert_series_equal

def create_sample_dataframe_2():
    df = pd.DataFrame(
        {"Measurement" : [2.0,   4.0,   3.0,   0.0,   100.0, 3.0,   2.0,   1.0 ],
         "Valid"       : [True,  True,  True,  False, True,  True,  True,  True],
         "Trigger"     : [False, False, False, True,  False, False, False, True],
         })
    return df

def expected_result():
    return pd.DataFrame({"Sum" : [np.nan, np.nan, np.nan, 9.0, np.nan, np.nan, np.nan, 6.0],
                         "Mean" :[np.nan, np.nan, np.nan, 3.0, np.nan, np.nan, np.nan, 2.0]})

class Data_Preparation_Functions(unittest.TestCase):

    def test_backsummation(self):
        N_SUMMANDS = 3
        temp_vars = []

        df = create_sample_dataframe_2()
        for i in range(0,N_SUMMANDS):
            temp_var = "M_{0}".format(i)
            df[temp_var] = df["Measurement"].shift(i)
            temp_vars.append(temp_var)

        df["Sum"]  = df[temp_vars].sum(axis=1)
        df["Mean"] = df[temp_vars].mean(axis=1)
        df.loc[(df["Trigger"]==False), "Sum"] = np.nan
        df.loc[(df["Trigger"]==False), "Mean"] = np.nan

        assert_series_equal(expected_result()["Sum"],df["Sum"])
        assert_series_equal(expected_result()["Mean"],df["Mean"])

    def test_rolling(self):
        df = create_sample_dataframe_2()
        df["Sum"]  = df[(df["Valid"] == True)]["Measurement"].rolling(window=3).sum()
        df["Mean"] = df[(df["Valid"] == True)]["Measurement"].rolling(window=3).mean()

        df.loc[(df["Trigger"]==False), "Sum"] = np.nan
        df.loc[(df["Trigger"]==False), "Mean"] = np.nan
        assert_series_equal(expected_result()["Sum"],df["Sum"])
        assert_series_equal(expected_result()["Mean"],df["Mean"])


if __name__ == '__main__':
    suite = unittest.TestLoader().loadTestsFromTestCase(Data_Preparation_Functions)
    unittest.TextTestRunner(verbosity=2).run(suite)

Any help or solution is greatly appreciated. 非常感谢任何帮助或解决方案。 Thanks and Cheers! 谢谢,干杯!

EDIT: Clarification: This is the resulting dataframe I expect: 编辑:澄清:这是我期望的结果数据帧:

      Measurement  Trigger  Valid   Sum   Mean
   0          2.0    False   True   NaN    NaN
   1          4.0    False   True   NaN    NaN
   2          3.0    False   True   NaN    NaN
   3          0.0     True  False   9.0    3.0
   4        100.0    False   True   NaN    NaN
   5          3.0    False   True   NaN    NaN
   6          2.0    False   True   NaN    NaN
   7          1.0     True   True   6.0    2.0

EDIT2: Another clarification: 编辑2:另一个澄清:

I did indeed not miscalculate, but rather I did not make my intentions as clear as I could have. 我确实没有计算错误,而是我没有尽可能明确地表达我的意图。 Here's another try using the same dataframe: 这是使用相同数据帧的另一个尝试:

期望的数据帧,突出显示相关字段

Let's first look at the Trigger column: We find the first True in index 3 (green rectangle). 让我们首先看一下Trigger列:我们在索引3(绿色矩形)中找到第一个True So index 3 is the point, where we start looking. 所以索引3是我们开始寻找的点。 There is no valid measurement at index 3 (Column Valid is False ; red rectangle). 索引3处没有有效测量值(Column ValidFalse ;红色矩形)。 So, we start to go further back in time, until we have accumulated three lines, where Valid is True . 所以,我们开始回到过去,直到我们累积了三行,其中ValidTrue This happens for indices 2,1 and 0. For these three indices, we calculate the sum and mean of the column Measurement (blue rectangle): 对于索引2,1和0,会发生这种情况。对于这三个索引,我们计算列Measurement (蓝色矩形)的总和和平均值:

  • SUM: 2.0 + 4.0 + 3.0 = 9.0 SUM:2.0 + 4.0 + 3.0 = 9.0
  • MEAN: (2.0 + 4.0 + 3.0) / 3 = 3.0 MEAN:(2.0 + 4.0 + 3.0)/ 3 = 3.0

Now we start the next iteration of this little algorithm: Look again for the next True in the Trigger column. 现在我们开始这个小算法的下一次迭代:再次查看Trigger列中的下一个True We find it at index 7 (green rectangle). 我们在索引7(绿色矩形)找到它。 There is also a valid measuremnt at index 7, so we include it this time. 在索引7处还有一个有效的度量标准,所以我们这次包括它。 For our calculation, we use indices 7,6 and 5 (green rectangle), and thus get: 对于我们的计算,我们使用索引7,6和5(绿色矩形),从而得到:

  • SUM: 1.0 + 2.0 + 3.0 = 6.0 SUM:1.0 + 2.0 + 3.0 = 6.0
  • MEAN: (1.0 + 2.0 + 3.0) / 3 = 2.0 意思是:(1.0 + 2.0 + 3.0)/ 3 = 2.0

I hope, this sheds more light on this little problem. 我希望,这会对这个小问题有所了解。

Heres an option, take the 3 period rolling mean and sum 继承人选择,采取3期滚动均值和总和

df['RollM'] = df.Measurement.rolling(window=3,min_periods=0).mean()

df['RollS'] = df.Measurement.rolling(window=3,min_periods=0).sum()

Now set False Triggers equals to NaN 现在设置False Triggers等于NaN

df.loc[df.Trigger == False,['RollS','RollM']] = np.nan

yields 产量

   Measurement  Trigger  Valid     RollM  RollS
0          2.0    False   True       NaN    NaN
1          4.0    False   True       NaN    NaN
2          3.0    False   True       NaN    NaN
3          0.0     True  False  2.333333    7.0
4        100.0    False   True       NaN    NaN
5          3.0    False   True       NaN    NaN
6          2.0    False   True       NaN    NaN
7          1.0     True   True  2.000000    6.0

Edit, updated to reflect valid argument 编辑,更新以反映有效参数

df['mean'],df['sum'] = np.nan,np.nan

roller = df.Measurement.rolling(window=3,min_periods=0).agg(['mean','sum'])

df.loc[(df.Trigger == True) & (df.Valid == True),['mean','sum']] = roller

df.loc[(df.Trigger == True) & (df.Valid == False),['mean','sum']] = roller.shift(1)

Yields 产量

  Measurement  Trigger  Valid  mean  sum
0          2.0    False   True   NaN  NaN
1          4.0    False   True   NaN  NaN
2          3.0    False   True   NaN  NaN
3          0.0     True  False   3.0  9.0
4        100.0    False   True   NaN  NaN
5          3.0    False   True   NaN  NaN
6          2.0    False   True   NaN  NaN
7          1.0     True   True   2.0  6.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM