[英]Get rows whose timestamps are within irregular time intervals pandas (Time Series)
Let's say I have a dataframe like this:假设我有一个这样的数据框:
>>> i = pd.to_datetime(np.random.randint(time.time(), time.time()+10000, 15), unit='ms').sort_values()
>>> df = pd.DataFrame({'A':range(15),'B':range(10,40,2),'C':range(10,55,3)},index = i)
>>> df
A B C
1970-01-19 05:31:36.629 0 10 10
1970-01-19 05:31:36.710 1 12 13
1970-01-19 05:31:37.779 2 14 16
1970-01-19 05:31:38.761 3 16 19
1970-01-19 05:31:39.520 4 18 22
1970-01-19 05:31:39.852 5 20 25
1970-01-19 05:31:39.994 6 22 28
1970-01-19 05:31:41.370 7 24 31
1970-01-19 05:31:41.667 8 26 34
1970-01-19 05:31:42.515 9 28 37
1970-01-19 05:31:42.941 10 30 40
1970-01-19 05:31:43.037 11 32 43
1970-01-19 05:31:43.253 12 34 46
1970-01-19 05:31:43.333 13 36 49
1970-01-19 05:31:44.135 14 38 52
What I want is:我想要的是:
A B C
1970-01-19 05:31:37.779 2.0 14.0 16.0 #last value within 2000 milli-sec interval from 05:31:36
1970-01-19 05:31:38.761 3.0 16.0 19.0 ##last value from the ^ value within 1000 msec interval
1970-01-19 05:31:39.994 6.0 22.0 28.0 #last value within 2000 milli-sec interval from 05:31:38
1970-01-19 05:31:39.994 6.0 22.0 28.0 *##last value from the ^ value within 1000 msec interval
1970-01-19 05:31:41.667 8.0 26.0 34.0 #last value within 2000 milli-sec interval from 05:31:40
1970-01-19 05:31:42.515 9.0 28.0 37.0 ##last value from the ^ value within 1000 msec interval
1970-01-19 05:31:43.333 13.0 36.0 49.0 #last value within 2000 milli-sec interval from 05:31:42
1970-01-19 05:31:44.135 14.0 38.0 52.0 ##last value from the ^ value within 1000 msec interval
I can achieve the rows marked with #
s with this code:我可以使用以下代码实现标有
#
s 的行:
>>> df.resample('2000ms').ffill().dropna(axis=0)
A B C
1970-01-19 05:31:38 2.0 14.0 16.0
1970-01-19 05:31:40 6.0 22.0 28.0
1970-01-19 05:31:42 8.0 26.0 34.0
1970-01-19 05:31:44 13.0 36.0 49.0
# note I do not care about how the timestamps are getting printed, I just want the correct values.
I can't find a solution with pandas that will give me the desired output.我找不到可以为我提供所需输出的 Pandas 解决方案。 I can do this with two dataframes, one sampled at
2000ms
and another at 1000ms
and then loop probably, and inserting accordingly.我可以用两个数据帧来做到这一点,一个在
2000ms
1000ms
采样,另一个在1000ms
采样,然后可能循环,并相应地插入。
The problem is, the actual size of my data is really large, with over 4000000 rows and 52 columns.问题是,我的数据的实际大小非常大,超过 4000000 行和 52 列。 If it is possible to avoid two dfs, or loops, I would definitely want to take that.
如果可以避免两个 dfs 或循环,我肯定会接受。
NOTE : The *
marked row gets repeated, as there are no data within 1000ms time interval from the last value, so the last seen value is repeated.注意:
*
标记的行会重复,因为从最后一个值开始的 1000 毫秒时间间隔内没有数据,因此会重复最后看到的值。 The same should happen for 2000ms time intervals as well.同样的情况也应该发生在 2000 毫秒的时间间隔内。
If possible please show me a way... Thanks!如果可能,请告诉我一种方法...谢谢!
EDIT : Edited as per John Zwinck's comment :编辑:根据John Zwinck 的评论进行编辑:
import datetime
def last_time(time):
time = str(time)
start_time = datetime.datetime.strptime(time[11:],'%H:%M:%S.%f')
end_time = start_time + datetime.timedelta(microseconds=1000000)
tempdf = df.between_time(*pd.to_datetime([str(start_time),str(end_time)]).time).iloc[-1]
return tempdf
df['timestamp'] = df.index
df2 = df.resample('2000ms').ffill().dropna(axis=0)
df3 = df2.apply(lambda x:last_time(x['timestamp']), axis = 1)
pd.concat([df2, df3]).sort_index(kind='merge')
This gives:这给出:
A B C timestamp
1970-01-19 05:31:38 2.0 14.0 16.0 1970-01-19 05:31:37.779
1970-01-19 05:31:38 3.0 16.0 19.0 1970-01-19 05:31:38.761
1970-01-19 05:31:40 6.0 22.0 28.0 1970-01-19 05:31:39.994
1970-01-19 05:31:40 6.0 22.0 28.0 1970-01-19 05:31:39.994
1970-01-19 05:31:42 8.0 26.0 34.0 1970-01-19 05:31:41.667
1970-01-19 05:31:42 9.0 28.0 37.0 1970-01-19 05:31:42.515
1970-01-19 05:31:44 13.0 36.0 49.0 1970-01-19 05:31:43.333
1970-01-19 05:31:44 14.0 38.0 52.0 1970-01-19 05:31:44.135
Which is okay, except the apply part takes really long time!没关系,除了应用部分需要很长时间!
For easier copy:为了更容易复制:
,A,B,C
1970-01-19 05:31:36.629,0,10,10
1970-01-19 05:31:36.710,1,12,13
1970-01-19 05:31:37.779,2,14,16
1970-01-19 05:31:38.761,3,16,19
1970-01-19 05:31:39.520,4,18,22
1970-01-19 05:31:39.852,5,20,25
1970-01-19 05:31:39.994,6,22,28
1970-01-19 05:31:41.370,7,24,31
1970-01-19 05:31:41.667,8,26,34
1970-01-19 05:31:42.515,9,28,37
1970-01-19 05:31:42.941,10,30,40
1970-01-19 05:31:43.037,11,32,43
1970-01-19 05:31:43.253,12,34,46
1970-01-19 05:31:43.333,13,36,49
1970-01-19 05:31:44.135,14,38,52
The slow part of your existing code is the creation of df3
, so I'll optimize that.现有代码的缓慢部分是
df3
的创建,因此我将对其进行优化。
First, note that your last_time(x)
function looks for the last record within the time range from x to x + 1 second.首先,请注意
last_time(x)
函数在 x 到 x + 1 秒的时间范围内查找最后一条记录。
Instead of using a loop, we can start by offsetting the time in the entire vector:我们可以从偏移整个向量中的时间开始,而不是使用循环:
end_times = df2.timestamp + datetime.timedelta(microseconds=1000000)
Then we can use numpy.searchsorted()
(a highly underrated function!) to search for those times in df
:然后我们可以使用
numpy.searchsorted()
(一个被严重低估的函数!)在df
搜索这些时间:
idx = np.searchsorted(df.timestamp, end_times)
Incidentally, df.timestamp.searchsorted(end_times)
does the same thing.顺便说一句,
df.timestamp.searchsorted(end_times)
做同样的事情。
Finally, note that each of those generated indexes is one after what we want (we don't want the values 1 second after, we want the one just before that):最后,请注意,这些生成的索引中的每一个都是我们想要的(我们不想要 1 秒后的值,我们想要在那之前的值):
df3a = df.iloc[idx - 1]
This gives the same result as your df3
except the index is not rounded down, so just replace it:这给出了与
df3
相同的结果,只是索引没有四舍五入,所以只需替换它:
df3a.index = df2.index
This is exactly the same as your df3
, but calculated much more quickly.这与您的
df3
完全相同,但计算速度要快得多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.