简体   繁体   中英

Pandas taking Cumulative Sum with Reset

Problem

I'm trying to keep a running total of consecutive timestamps (minute frequency). I currently have a way of taking a cumulative sum and resetting it on the condition that two columns do not match, but its done with a for loop. I was wondering if there is a way to do this without the loop.

Code

cb_arbitrage['shift'] = cb_arbitrage.index.shift(1, freq='T')

Returns:

                        cccccccc     bbbbbbbb  cb_spread         shift
timestamp                                                                   
2017-07-07 18:23:00  2535.002000  2524.678462  10.323538 2017-07-07 18:24:00
2017-07-07 18:24:00  2535.007826  2523.297619  11.710207 2017-07-07 18:25:00
2017-07-07 18:25:00  2535.004167  2524.391000  10.613167 2017-07-07 18:26:00
2017-07-07 18:26:00  2534.300000  2521.838667  12.461333 2017-07-07 18:27:00
2017-07-07 18:27:00  2530.231429  2520.195625  10.035804 2017-07-07 18:28:00
2017-07-07 18:28:00  2529.444667  2518.782143  10.662524 2017-07-07 18:29:00
2017-07-07 18:29:00  2528.988000  2518.802963  10.185037 2017-07-07 18:30:00
2017-07-07 18:59:00  2514.403367  2526.473333  12.069966 2017-07-07 19:00:00
2017-07-07 19:01:00  2516.410000  2528.980000  12.570000 2017-07-07 19:02:00

Then I do the following:

cb_arbitrage['shift'] = cb_arbitrage['shift'].shift(1)
cb_arbitrage['shift'][0] = cb_arbitrage.index[0]
cb_arbitrage['count'] = 0

Which returns:

                        cccccccc     bbbbbbbb  cb_spread               shift  count
timestamp                                                                          
2017-07-07 18:23:00  2535.002000  2524.678462  10.323538 2017-07-07 18:23:00      0
2017-07-07 18:24:00  2535.007826  2523.297619  11.710207 2017-07-07 18:24:00      0
2017-07-07 18:25:00  2535.004167  2524.391000  10.613167 2017-07-07 18:25:00      0
2017-07-07 18:26:00  2534.300000  2521.838667  12.461333 2017-07-07 18:26:00      0
2017-07-07 18:27:00  2530.231429  2520.195625  10.035804 2017-07-07 18:27:00      0
2017-07-07 18:28:00  2529.444667  2518.782143  10.662524 2017-07-07 18:28:00      0
2017-07-07 18:29:00  2528.988000  2518.802963  10.185037 2017-07-07 18:29:00      0
2017-07-07 18:59:00  2514.403367  2526.473333  12.069966 2017-07-07 18:30:00      0
2017-07-07 19:01:00  2516.410000  2528.980000  12.570000 2017-07-07 19:00:00      0

Then, the loop to calculate the cumulative sum, with reset:

count = 0
for i, row in cb_arbitrage.iterrows():

    if i == cb_arbitrage.loc[i]['shift']:
        count += 1
        cb_arbitrage.set_value(i, 'count', count)
    else:
        count = 1
        cb_arbitrage.set_value(i, 'count', count)

Which gives me my expected result:

                        cccccccc     bbbbbbbb  cb_spread               shift  count
timestamp                                                                          
2017-07-07 18:23:00  2535.002000  2524.678462  10.323538 2017-07-07 18:23:00      1
2017-07-07 18:24:00  2535.007826  2523.297619  11.710207 2017-07-07 18:24:00      2
2017-07-07 18:25:00  2535.004167  2524.391000  10.613167 2017-07-07 18:25:00      3
2017-07-07 18:26:00  2534.300000  2521.838667  12.461333 2017-07-07 18:26:00      4
2017-07-07 18:27:00  2530.231429  2520.195625  10.035804 2017-07-07 18:27:00      5
2017-07-07 18:28:00  2529.444667  2518.782143  10.662524 2017-07-07 18:28:00      6
2017-07-07 18:29:00  2528.988000  2518.802963  10.185037 2017-07-07 18:29:00      7
2017-07-07 18:59:00  2514.403367  2526.473333  12.069966 2017-07-07 18:30:00      1
2017-07-07 19:01:00  2516.410000  2528.980000  12.570000 2017-07-07 19:00:00      1
2017-07-07 21:55:00  2499.904560  2510.814000  10.909440 2017-07-07 19:02:00      1
2017-07-07 21:56:00  2500.134615  2510.812857  10.678242 2017-07-07 21:56:00      2

You can use the diff method which finds the difference between the current row and previous row. You can then check and see if this difference is equal to one minute. From here, there is lots of trickery to reset streaks within data.

We first take the cumulative sum of the boolean Series, which gets us close to what we want. To reset the series we multiply this cumulative sum series by the original boolean, since False evaluates as 0.

s = cb_arbitrage.timestamp.diff() == pd.Timedelta('1 minute')
s1 = s.cumsum()
s.mul(s1).diff().where(lambda x: x < 0).ffill().add(s1, fill_value=0) + 1

0     1.0
1     2.0
2     3.0
3     4.0
4     5.0
5     6.0
6     7.0
7     1.0
8     1.0
9     1.0
10    2.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM