[英]Get rows whose timestamps are within specific sliding window time interval pandas (Time Series)
I have a dataframe like this:我有一个像这样的 dataframe:
i = pd.to_datetime(np.random.randint(time.time(), time.time()+5000, 10), unit='ms').sort_values()
df = pd.DataFrame({'A':range(10),'B':range(10,30,2),'C':range(10,40,3)},index = i)
df
A B C
1970-01-19 04:28:30.030 0 10 10
1970-01-19 04:28:30.374 1 12 13
1970-01-19 04:28:31.055 2 14 16
1970-01-19 04:28:32.026 3 16 19
1970-01-19 04:28:32.234 4 18 22
1970-01-19 04:28:32.569 5 20 25
1970-01-19 04:28:32.595 6 22 28
1970-01-19 04:28:33.520 7 24 31
1970-01-19 04:28:33.882 8 26 34
1970-01-19 04:28:34.019 9 28 37
What I want is, for each index, the last row which is within '1s' interval from that index:我想要的是,对于每个索引,在该索引的“1s”间隔内的最后一行:
df2
ix A B C
1970-01-19 04:28:30.030 1970-01-19 04:28:30.374 1 12 13
1970-01-19 04:28:30.374 1970-01-19 04:28:31.055 2 14 16
1970-01-19 04:28:31.055 1970-01-19 04:28:32.026 3 16 19
1970-01-19 04:28:32.026 1970-01-19 04:28:32.595 6 22 28
1970-01-19 04:28:32.234 1970-01-19 04:28:32.595 6 22 28
1970-01-19 04:28:32.569 1970-01-19 04:28:33.520 7 24 31
1970-01-19 04:28:32.595 1970-01-19 04:28:33.520 7 24 31
1970-01-19 04:28:33.520 1970-01-19 04:28:34.019 9 28 37
1970-01-19 04:28:33.882 1970-01-19 04:28:34.019 9 28 37
1970-01-19 04:28:34.019 nan nan nan nan
I am currently doing this with loops.我目前正在使用循环执行此操作。 At each index I am using
df.between_time
to get all the rows in the time interval and then selecting the last row.在每个索引处,我使用
df.between_time
来获取时间间隔中的所有行,然后选择最后一行。 But it is really slow, as expected.但正如预期的那样,它真的很慢。 I need something like
df.shift
for time, I checked out tshift
and shift(periods = 1, freq = 'S')
but they do not work like shift, rather they add specified time to each index.我需要
df.shift
的时间,我检查了tshift
和shift(periods = 1, freq = 'S')
但它们不像 shift 那样工作,而是为每个索引添加指定的时间。 Can somebody help me in achieving this?有人可以帮助我实现这一目标吗? Thanks.
谢谢。
Note: The ix
columns in the desired output is optional.注意:所需 output 中的
ix
列是可选的。
PS: If a min_periods
parameter (like pd.df.rolling
) is possible, that would be great! PS:如果
min_periods
参数(如pd.df.rolling
)是可能的,那就太好了!
EDIT:编辑:
For a starting df:对于起始df:
A B C
1970-01-19 04:28:34.883 0 10 10
1970-01-19 04:28:34.900 1 12 13
1970-01-19 04:28:35.531 2 14 16
1970-01-19 04:28:36.845 3 16 19
1970-01-19 04:28:37.664 4 18 22
1970-01-19 04:28:38.332 5 20 25
1970-01-19 04:28:38.444 6 22 28
1970-01-19 04:28:38.724 7 24 31
1970-01-19 04:28:38.787 8 26 34
1970-01-19 04:28:38.951 9 28 37
df['time'] = df.index
def last_time(time):
time = str(time)
start_time = datetime.datetime.strptime(time[11:],'%H:%M:%S.%f')
end_time = start_time + datetime.timedelta(0,1)
return df.between_time(start_time = str(start_time)[11:-7],end_time=
str(end_time)[11:-7]).iloc[-1]
df.apply(lambda x:last_time(x['time']),axis = 1)
# Output:
A B C time
1970-01-19 04:28:34.883 1 12 13 1970-01-19 04:28:34.900
1970-01-19 04:28:34.900 1 12 13 1970-01-19 04:28:34.900
1970-01-19 04:28:35.531 2 14 16 1970-01-19 04:28:35.531
1970-01-19 04:28:36.845 3 16 19 1970-01-19 04:28:36.845
1970-01-19 04:28:37.664 4 18 22 1970-01-19 04:28:37.664
1970-01-19 04:28:38.332 9 28 37 1970-01-19 04:28:38.951
1970-01-19 04:28:38.444 9 28 37 1970-01-19 04:28:38.951
1970-01-19 04:28:38.724 9 28 37 1970-01-19 04:28:38.951
But as you can see, I can only get second
level accuracy, that is it is considering between 34 to 35
, hence it is missing 35.531
which is within interval from both 34.883
and 34.900
.但是正如您所看到的,我只能获得
second
精度,即它正在考虑34 to 35
之间,因此它缺少35.531
,它在34.883
和34.900
的区间内。
assuming your time is sorted, then the corresponding row for row 2 would be strictly larger than that for row 1. eg: if row 6 is the row for row1, then row2 would only need to search row that is >=6假设您的时间已排序,那么第 2 行的相应行将严格大于第 1 行的行。例如:如果第 6 行是第 1 行的行,则第 2 行只需要搜索 >=6 的行
With this in mind we just need to loop through the index once(complexity linear: O(n)):考虑到这一点,我们只需要遍历索引一次(复杂度线性:O(n)):
import pandas as pd
from datetime import datetime
def time_compare(t1,t2):
return datetime.strptime(t1,'%Y-%m-%d %H:%M:%S.%f').timestamp() - datetime.strptime(t2,'%Y-%m-%d %H:%M:%S.%f').timestamp() < 1
index_j = []
cursor = 0
tmp = list(df.index)
for i in tmp:
if cursor < len(tmp):
pass
else:
index_j.append(cursor-1)
continue
while time_compare(tmp[cursor],i):
cursor += 1
if cursor < len(tmp):
pass
else:
break
index_j.append(cursor-1)
Using this df:使用这个df:
>>> df
A B C
1970-01-19 04:28:34.883 0 10 10
1970-01-19 04:28:34.900 1 12 13
1970-01-19 04:28:35.531 2 14 16
1970-01-19 04:28:36.845 3 16 19
1970-01-19 04:28:37.664 4 18 22
1970-01-19 04:28:38.332 5 20 25
1970-01-19 04:28:38.444 6 22 28
1970-01-19 04:28:38.724 7 24 31
1970-01-19 04:28:38.787 8 26 34
1970-01-19 04:28:38.951 9 28 37
>>> index_j
[2, 2, 2, 4, 6, 9, 9, 9, 9, 9]
Using the index:使用索引:
>>> [tmp[i] for i in index_j]
['1970-01-19 04:28:35.531', '1970-01-19 04:28:35.531', '1970-01-19 04:28:35.531', '1970-01-19 04:28:37.664', '1970-01-19 04:28:38.444', '1970-01-19 04:28:38.951', '1970-01-19 04:28:38.951', '1970-01-19 04:28:38.951', '1970-01-19 04:28:38.951', '1970-01-19 04:28:38.951']
I kind of got an answer, hence sharing, if anyone has a better answer you are most welcome to add it.我有点得到答案,因此分享,如果有人有更好的答案,欢迎您添加它。
i = pd.to_datetime(np.random.randint(time.time(), time.time()+5000, 10), unit='ms').sort_values()
df = pd.DataFrame({'A':range(10),'B':range(10,30,2),'C':range(10,40,3)},index = i)
df
df
A B C
1970-01-19 04:28:30.030 0 10 10
1970-01-19 04:28:30.374 1 12 13
1970-01-19 04:28:31.055 2 14 16
1970-01-19 04:28:32.026 3 16 19
1970-01-19 04:28:32.234 4 18 22
1970-01-19 04:28:32.569 5 20 25
1970-01-19 04:28:32.595 6 22 28
1970-01-19 04:28:33.520 7 24 31
1970-01-19 04:28:33.882 8 26 34
1970-01-19 04:28:34.019 9 28 37
df['time'] = df.index
def last_time(time):
time = str(time)
start_time = datetime.datetime.strptime(time[11:],'%H:%M:%S.%f')
end_time = start_time + datetime.timedelta(0,1)
tempdf = df.between_time(*pd.to_datetime([str(start_time),str(end_time)]).time).iloc[-1]
if str(tempdf['time']) == str(time):
tempdf.iloc[:] = np.nan
return tempdf
else:
return tempdf
df.apply(lambda x:last_time(x['time']),axis = 1)
A B C time
1970-01-19 04:28:34.883 2.0 14.0 16.0 1970-01-19 04:28:35.531000
1970-01-19 04:28:34.900 2.0 14.0 16.0 1970-01-19 04:28:35.531000
1970-01-19 04:28:35.531 NaN NaN NaN NaN
1970-01-19 04:28:36.845 4.0 18.0 22.0 1970-01-19 04:28:37.664000
1970-01-19 04:28:37.664 6.0 22.0 28.0 1970-01-19 04:28:38.444000
1970-01-19 04:28:38.332 9.0 28.0 37.0 1970-01-19 04:28:38.951000
1970-01-19 04:28:38.444 9.0 28.0 37.0 1970-01-19 04:28:38.951000
1970-01-19 04:28:38.724 9.0 28.0 37.0 1970-01-19 04:28:38.951000
1970-01-19 04:28:38.787 9.0 28.0 37.0 1970-01-19 04:28:38.951000
1970-01-19 04:28:38.951 NaN NaN NaN NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.