简体   繁体   中英

python aggregation of two time-series

I have two pandas time-series dataframes and I want to aggregate the values against one time series based on the intervals of the other one. Let me show by example. The first time series is as follows:

        date    value
0 2016-03-21       10
1 2016-03-25       10
2 2016-04-10       10
3 2016-05-05       10

The second one is a date range with 10 calendar days intervals extracted from the above series. I have written the code to extract this from above data.

     date
 0   2016-03-21
 1   2016-03-31
 2   2016-04-10
 3   2016-04-20
 4   2016-04-30

I want to write some code to get this resultant dataframe:

     date        value
 0   2016-03-21  20
 1   2016-03-31   0
 2   2016-04-10  10
 3   2016-04-20   0
 4   2016-04-30  10

Could please suggest a way to do this without using loops(preferably) in python?

You can bin the data in df1 based on bins in df2 dates,

bins = pd.date_range(df2.date.min(), df2.date.max() + pd.DateOffset(10), freq = '10D')
labels = df2.date
df1.groupby(pd.cut(df1.date, bins = bins, right = False, labels = labels)).value.sum().reset_index()


    date        value
0   2016-03-21  20
1   2016-03-31  0
2   2016-04-10  10
3   2016-04-20  0
4   2016-04-30  10

Numpy searchsorted

This is the first thing I thought of but it wasn't trivial to iron out. @Vaishali's answer is in spirit very similar to this and simpler. But I'm like a dog with a bone and I can't let it go until I figure it out.

To explain a little bit. searchsorted will go through an array, In this case the equally spaced dates, and find where in another array they would be placed in order to maintain sortedness. This sounds complicated but if we visualize, we can see what is going on. I'll use letters to demonstrate. I'll choose the letters to correspond with the dates.

x = np.array([*'abdg'])
y = np.array([*'acdef'])

Notice that for each letter in x I found where the backstop was in y

#  i -> 0 0   2     4
#  x -> a b   d     g
#  y -> a   c d e f

This works out to what I do below.

Setup

df = pd.DataFrame(dict(
    date=pd.to_datetime(['2016-03-21', '2016-03-25', '2016-04-10', '2016-05-05']),
    value=[10, 10, 10, 10]
))

dates = pd.date_range(df.date.min(), df.date.max(), freq='10D')

Solution

d = df.date.values
v = df.value.values

i = dates.searchsorted(d, side='right') - 1
a = np.zeros(len(dates), dtype=v.dtype)

np.add.at(a, i, v)

pd.DataFrame(dict(
    date=dates, value=a
))

        date  value
0 2016-03-21     20
1 2016-03-31      0
2 2016-04-10     10
3 2016-04-20      0
4 2016-04-30     10

You'll notice I used np.add.at inorder to sum v at just the right spots. I could have also done this with np.bincount . I like the approach above better because np.bincount casts to float even though the v is of type int .

d = df.date.values
v = df.value.values

i = dates.searchsorted(d, side='right') - 1

pd.DataFrame(dict(
    date=dates, value=np.bincount(i, v).astype(v.dtype)
))

        date  value
0 2016-03-21     20
1 2016-03-31      0
2 2016-04-10     10
3 2016-04-20      0
4 2016-04-30     10

Just have time adding my solution , numpy broadcast

s1=df1.date.values
s2=df2.date.values
a=(np.abs(s1-s2[:,None])/np.timedelta64(60*60*24, 's')<10).dot(df1.value.values)
a
Out[183]: array([20, 10, 10,  0, 10], dtype=int64)

#df2['value']=a

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM