简体   繁体   中英

Calculate the sum of the differences between all dates within an expanding window of dates

The output column below is what I'm trying to calculate and the diffs column is an explanation of the differences that are summed to calculate output .

+------------+--------+-------------+
|       date | output |    diffs    |
+------------+--------+-------------+
| 01/01/2000 |        |             |
| 10/01/2000 |      9 | [9]         |
| 20/01/2000 |     29 | [10, 19]    |
| 25/01/2000 |     44 | [5, 15, 24] |
+------------+--------+-------------+

I've thought about using rolling and then creating a new column within each window for the diffs based on the last record in the current window and then summing these. However, rolling doesn't seem to have the ability to fix at the beginning of a DataFrame. I suppose I could calculate the difference between the minimum and maximum dates and use this as the rolling period but that seems hacky.

I've also looked at expanding but I couldn't see a way of creating new diffs as the window expanded.

Is there a non-loop, hopefully vectorisable, solution to this?

Here's the DataFrame:

import pandas as pd
import numpy as np


df = pd.DataFrame(
    {
        'date': (
            dt.datetime(2000, 1, 1), dt.datetime(2000, 1, 10),
            dt.datetime(2000, 1, 20), dt.datetime(2000, 1, 25),
        ),
        'output': (np.NaN, 9, 29, 44),
    }
)

If you're looking for output, try:

datediff = df.date.diff()/pd.Timedelta('1D')

df['output'] = (datediff * np.arange(len(df))).cumsum()

Output:

        date  output
0 2000-01-01     NaN
1 2000-01-10     9.0
2 2000-01-20    29.0
3 2000-01-25    44.0

I'll leave the it to you to work out the logic behind.

We may still need for loop, however we can do numpy boardcast in order to reduce the calculation time

s = df.date.values
df['new']  = [y[:x][::-1] for x,y in enumerate((s[:,None]-s).astype('timedelta64[D]'))]
df
        date  output                         new
0 2000-01-01     NaN                          []
1 2000-01-10     9.0                    [9 days]
2 2000-01-20    29.0          [10 days, 19 days]
3 2000-01-25    44.0  [5 days, 15 days, 24 days]

For you output

df.date.diff().dt.days.cumsum()

Using numpy broadcasting without looping:

i = df.date.dt.day.values
j = np.arange(len(df))

df['output'] = np.triu(np.where((j < j[:, None]), i, (i - i[:, None]))).sum(axis = 0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM