[英]python aggregation of two time-series
I have two pandas time-series dataframes and I want to aggregate the values against one time series based on the intervals of the other one. 我有两个pandas时间序列数据帧,我想根据另一个时间序列的时间间隔将值汇总到一个时间序列。 Let me show by example.
让我举例说明。 The first time series is as follows:
第一个时间序列如下:
date value
0 2016-03-21 10
1 2016-03-25 10
2 2016-04-10 10
3 2016-05-05 10
The second one is a date range with 10 calendar days intervals extracted from the above series. 第二个是从上述系列中提取的具有10个日历日间隔的日期范围。 I have written the code to extract this from above data.
我编写了代码以从上面的数据中提取。
date
0 2016-03-21
1 2016-03-31
2 2016-04-10
3 2016-04-20
4 2016-04-30
I want to write some code to get this resultant dataframe: 我想写一些代码来获得这个结果数据帧:
date value
0 2016-03-21 20
1 2016-03-31 0
2 2016-04-10 10
3 2016-04-20 0
4 2016-04-30 10
Could please suggest a way to do this without using loops(preferably) in python? 请不要在python中使用循环(最好),建议一种方法来做到这一点?
You can bin the data in df1 based on bins in df2 dates, 你可以根据df2日期的bin,在df1中对数据进行分区,
bins = pd.date_range(df2.date.min(), df2.date.max() + pd.DateOffset(10), freq = '10D')
labels = df2.date
df1.groupby(pd.cut(df1.date, bins = bins, right = False, labels = labels)).value.sum().reset_index()
date value
0 2016-03-21 20
1 2016-03-31 0
2 2016-04-10 10
3 2016-04-20 0
4 2016-04-30 10
searchsorted
searchsorted
This is the first thing I thought of but it wasn't trivial to iron out. 这是我想到的第一件事,但要解决这个问题并非易事。 @Vaishali's answer is in spirit very similar to this and simpler.
@Vaishali的答案在精神上与此非常相似且更简单。 But I'm like a dog with a bone and I can't let it go until I figure it out.
但我就像一条骨头的狗,我不能放手,直到我弄明白。
To explain a little bit. 解释一下。
searchsorted
will go through an array, In this case the equally spaced dates, and find where in another array they would be placed in order to maintain sortedness. searchsorted
将通过一个数组,在这种情况下是等间隔的日期,并找到它们将被放置在另一个数组中的位置以保持排序。 This sounds complicated but if we visualize, we can see what is going on. 这听起来很复杂但如果我们想象,我们可以看到发生了什么。 I'll use letters to demonstrate.
我会用信件来证明。 I'll choose the letters to correspond with the dates.
我会选择与日期对应的字母。
x = np.array([*'abdg'])
y = np.array([*'acdef'])
Notice that for each letter in x
I found where the backstop was in y
请注意,在每个字母
x
我发现那里的逆止是y
# i -> 0 0 2 4
# x -> a b d g
# y -> a c d e f
This works out to what I do below. 这适用于我在下面做的事情。
df = pd.DataFrame(dict(
date=pd.to_datetime(['2016-03-21', '2016-03-25', '2016-04-10', '2016-05-05']),
value=[10, 10, 10, 10]
))
dates = pd.date_range(df.date.min(), df.date.max(), freq='10D')
d = df.date.values
v = df.value.values
i = dates.searchsorted(d, side='right') - 1
a = np.zeros(len(dates), dtype=v.dtype)
np.add.at(a, i, v)
pd.DataFrame(dict(
date=dates, value=a
))
date value
0 2016-03-21 20
1 2016-03-31 0
2 2016-04-10 10
3 2016-04-20 0
4 2016-04-30 10
You'll notice I used np.add.at
inorder to sum v
at just the right spots. 你会发现我用
np.add.at
序总结v
在恰当的地点。 I could have also done this with np.bincount
. 我也可以用
np.bincount
做到这np.bincount
。 I like the approach above better because np.bincount
casts to float
even though the v
is of type int
. 我更喜欢上面的方法,因为即使
v
是int
类型, np.bincount
转换为float
。
d = df.date.values
v = df.value.values
i = dates.searchsorted(d, side='right') - 1
pd.DataFrame(dict(
date=dates, value=np.bincount(i, v).astype(v.dtype)
))
date value
0 2016-03-21 20
1 2016-03-31 0
2 2016-04-10 10
3 2016-04-20 0
4 2016-04-30 10
Just have time adding my solution , numpy
broadcast 只是有时间添加我的解决方案,
numpy
广播
s1=df1.date.values
s2=df2.date.values
a=(np.abs(s1-s2[:,None])/np.timedelta64(60*60*24, 's')<10).dot(df1.value.values)
a
Out[183]: array([20, 10, 10, 0, 10], dtype=int64)
#df2['value']=a
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.