简体   繁体   English

两个时间序列的python聚合

[英]python aggregation of two time-series

I have two pandas time-series dataframes and I want to aggregate the values against one time series based on the intervals of the other one. 我有两个pandas时间序列数据帧,我想根据另一个时间序列的时间间隔将值汇总到一个时间序列。 Let me show by example. 让我举例说明。 The first time series is as follows: 第一个时间序列如下:

        date    value
0 2016-03-21       10
1 2016-03-25       10
2 2016-04-10       10
3 2016-05-05       10

The second one is a date range with 10 calendar days intervals extracted from the above series. 第二个是从上述系列中提取的具有10个日历日间隔的日期范围。 I have written the code to extract this from above data. 我编写了代码以从上面的数据中提取。

     date
 0   2016-03-21
 1   2016-03-31
 2   2016-04-10
 3   2016-04-20
 4   2016-04-30

I want to write some code to get this resultant dataframe: 我想写一些代码来获得这个结果数据帧:

     date        value
 0   2016-03-21  20
 1   2016-03-31   0
 2   2016-04-10  10
 3   2016-04-20   0
 4   2016-04-30  10

Could please suggest a way to do this without using loops(preferably) in python? 请不要在python中使用循环(最好),建议一种方法来做到这一点?

You can bin the data in df1 based on bins in df2 dates, 你可以根据df2日期的bin,在df1中对数据进行分区,

bins = pd.date_range(df2.date.min(), df2.date.max() + pd.DateOffset(10), freq = '10D')
labels = df2.date
df1.groupby(pd.cut(df1.date, bins = bins, right = False, labels = labels)).value.sum().reset_index()


    date        value
0   2016-03-21  20
1   2016-03-31  0
2   2016-04-10  10
3   2016-04-20  0
4   2016-04-30  10

Numpy searchsorted Numpy searchsorted

This is the first thing I thought of but it wasn't trivial to iron out. 这是我想到的第一件事,但要解决这个问题并非易事。 @Vaishali's answer is in spirit very similar to this and simpler. @Vaishali的答案在精神上与此非常相似且更简单。 But I'm like a dog with a bone and I can't let it go until I figure it out. 但我就像一条骨头的狗,我不能放手,直到我弄明白。

To explain a little bit. 解释一下。 searchsorted will go through an array, In this case the equally spaced dates, and find where in another array they would be placed in order to maintain sortedness. searchsorted将通过一个数组,在这种情况下是等间隔的日期,并找到它们将被放置在另一个数组中的位置以保持排序。 This sounds complicated but if we visualize, we can see what is going on. 这听起来很复杂但如果我们想象,我们可以看到发生了什么。 I'll use letters to demonstrate. 我会用信件来证明。 I'll choose the letters to correspond with the dates. 我会选择与日期对应的字母。

x = np.array([*'abdg'])
y = np.array([*'acdef'])

Notice that for each letter in x I found where the backstop was in y 请注意,在每个字母x我发现那里的逆止是y

#  i -> 0 0   2     4
#  x -> a b   d     g
#  y -> a   c d e f

This works out to what I do below. 这适用于我在下面做的事情。

Setup 设定

df = pd.DataFrame(dict(
    date=pd.to_datetime(['2016-03-21', '2016-03-25', '2016-04-10', '2016-05-05']),
    value=[10, 10, 10, 10]
))

dates = pd.date_range(df.date.min(), df.date.max(), freq='10D')

Solution

d = df.date.values
v = df.value.values

i = dates.searchsorted(d, side='right') - 1
a = np.zeros(len(dates), dtype=v.dtype)

np.add.at(a, i, v)

pd.DataFrame(dict(
    date=dates, value=a
))

        date  value
0 2016-03-21     20
1 2016-03-31      0
2 2016-04-10     10
3 2016-04-20      0
4 2016-04-30     10

You'll notice I used np.add.at inorder to sum v at just the right spots. 你会发现我用np.add.at序总结v在恰当的地点。 I could have also done this with np.bincount . 我也可以用np.bincount做到这np.bincount I like the approach above better because np.bincount casts to float even though the v is of type int . 我更喜欢上面的方法,因为即使vint类型, np.bincount转换为float

d = df.date.values
v = df.value.values

i = dates.searchsorted(d, side='right') - 1

pd.DataFrame(dict(
    date=dates, value=np.bincount(i, v).astype(v.dtype)
))

        date  value
0 2016-03-21     20
1 2016-03-31      0
2 2016-04-10     10
3 2016-04-20      0
4 2016-04-30     10

Just have time adding my solution , numpy broadcast 只是有时间添加我的解决方案, numpy广播

s1=df1.date.values
s2=df2.date.values
a=(np.abs(s1-s2[:,None])/np.timedelta64(60*60*24, 's')<10).dot(df1.value.values)
a
Out[183]: array([20, 10, 10,  0, 10], dtype=int64)

#df2['value']=a

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM