![](/img/trans.png)
[英]Get rows whose timestamps are within irregular time intervals pandas (Time Series)
[英]Get rows whose timestamps are within specific sliding window time interval pandas (Time Series)
我有一个像这样的 dataframe:
i = pd.to_datetime(np.random.randint(time.time(), time.time()+5000, 10), unit='ms').sort_values()
df = pd.DataFrame({'A':range(10),'B':range(10,30,2),'C':range(10,40,3)},index = i)
df
A B C
1970-01-19 04:28:30.030 0 10 10
1970-01-19 04:28:30.374 1 12 13
1970-01-19 04:28:31.055 2 14 16
1970-01-19 04:28:32.026 3 16 19
1970-01-19 04:28:32.234 4 18 22
1970-01-19 04:28:32.569 5 20 25
1970-01-19 04:28:32.595 6 22 28
1970-01-19 04:28:33.520 7 24 31
1970-01-19 04:28:33.882 8 26 34
1970-01-19 04:28:34.019 9 28 37
我想要的是,对于每个索引,在该索引的“1s”间隔内的最后一行:
df2
ix A B C
1970-01-19 04:28:30.030 1970-01-19 04:28:30.374 1 12 13
1970-01-19 04:28:30.374 1970-01-19 04:28:31.055 2 14 16
1970-01-19 04:28:31.055 1970-01-19 04:28:32.026 3 16 19
1970-01-19 04:28:32.026 1970-01-19 04:28:32.595 6 22 28
1970-01-19 04:28:32.234 1970-01-19 04:28:32.595 6 22 28
1970-01-19 04:28:32.569 1970-01-19 04:28:33.520 7 24 31
1970-01-19 04:28:32.595 1970-01-19 04:28:33.520 7 24 31
1970-01-19 04:28:33.520 1970-01-19 04:28:34.019 9 28 37
1970-01-19 04:28:33.882 1970-01-19 04:28:34.019 9 28 37
1970-01-19 04:28:34.019 nan nan nan nan
我目前正在使用循环执行此操作。 在每个索引处,我使用df.between_time
来获取时间间隔中的所有行,然后选择最后一行。 但正如预期的那样,它真的很慢。 我需要df.shift
的时间,我检查了tshift
和shift(periods = 1, freq = 'S')
但它们不像 shift 那样工作,而是为每个索引添加指定的时间。 有人可以帮助我实现这一目标吗? 谢谢。
注意:所需 output 中的ix
列是可选的。
PS:如果min_periods
参数(如pd.df.rolling
)是可能的,那就太好了!
编辑:
对于起始df:
A B C
1970-01-19 04:28:34.883 0 10 10
1970-01-19 04:28:34.900 1 12 13
1970-01-19 04:28:35.531 2 14 16
1970-01-19 04:28:36.845 3 16 19
1970-01-19 04:28:37.664 4 18 22
1970-01-19 04:28:38.332 5 20 25
1970-01-19 04:28:38.444 6 22 28
1970-01-19 04:28:38.724 7 24 31
1970-01-19 04:28:38.787 8 26 34
1970-01-19 04:28:38.951 9 28 37
df['time'] = df.index
def last_time(time):
time = str(time)
start_time = datetime.datetime.strptime(time[11:],'%H:%M:%S.%f')
end_time = start_time + datetime.timedelta(0,1)
return df.between_time(start_time = str(start_time)[11:-7],end_time=
str(end_time)[11:-7]).iloc[-1]
df.apply(lambda x:last_time(x['time']),axis = 1)
# Output:
A B C time
1970-01-19 04:28:34.883 1 12 13 1970-01-19 04:28:34.900
1970-01-19 04:28:34.900 1 12 13 1970-01-19 04:28:34.900
1970-01-19 04:28:35.531 2 14 16 1970-01-19 04:28:35.531
1970-01-19 04:28:36.845 3 16 19 1970-01-19 04:28:36.845
1970-01-19 04:28:37.664 4 18 22 1970-01-19 04:28:37.664
1970-01-19 04:28:38.332 9 28 37 1970-01-19 04:28:38.951
1970-01-19 04:28:38.444 9 28 37 1970-01-19 04:28:38.951
1970-01-19 04:28:38.724 9 28 37 1970-01-19 04:28:38.951
但是正如您所看到的,我只能获得second
精度,即它正在考虑34 to 35
之间,因此它缺少35.531
,它在34.883
和34.900
的区间内。
假设您的时间已排序,那么第 2 行的相应行将严格大于第 1 行的行。例如:如果第 6 行是第 1 行的行,则第 2 行只需要搜索 >=6 的行
考虑到这一点,我们只需要遍历索引一次(复杂度线性:O(n)):
import pandas as pd
from datetime import datetime
def time_compare(t1,t2):
return datetime.strptime(t1,'%Y-%m-%d %H:%M:%S.%f').timestamp() - datetime.strptime(t2,'%Y-%m-%d %H:%M:%S.%f').timestamp() < 1
index_j = []
cursor = 0
tmp = list(df.index)
for i in tmp:
if cursor < len(tmp):
pass
else:
index_j.append(cursor-1)
continue
while time_compare(tmp[cursor],i):
cursor += 1
if cursor < len(tmp):
pass
else:
break
index_j.append(cursor-1)
使用这个df:
>>> df
A B C
1970-01-19 04:28:34.883 0 10 10
1970-01-19 04:28:34.900 1 12 13
1970-01-19 04:28:35.531 2 14 16
1970-01-19 04:28:36.845 3 16 19
1970-01-19 04:28:37.664 4 18 22
1970-01-19 04:28:38.332 5 20 25
1970-01-19 04:28:38.444 6 22 28
1970-01-19 04:28:38.724 7 24 31
1970-01-19 04:28:38.787 8 26 34
1970-01-19 04:28:38.951 9 28 37
>>> index_j
[2, 2, 2, 4, 6, 9, 9, 9, 9, 9]
使用索引:
>>> [tmp[i] for i in index_j]
['1970-01-19 04:28:35.531', '1970-01-19 04:28:35.531', '1970-01-19 04:28:35.531', '1970-01-19 04:28:37.664', '1970-01-19 04:28:38.444', '1970-01-19 04:28:38.951', '1970-01-19 04:28:38.951', '1970-01-19 04:28:38.951', '1970-01-19 04:28:38.951', '1970-01-19 04:28:38.951']
我有点得到答案,因此分享,如果有人有更好的答案,欢迎您添加它。
i = pd.to_datetime(np.random.randint(time.time(), time.time()+5000, 10), unit='ms').sort_values()
df = pd.DataFrame({'A':range(10),'B':range(10,30,2),'C':range(10,40,3)},index = i)
df
df
A B C
1970-01-19 04:28:30.030 0 10 10
1970-01-19 04:28:30.374 1 12 13
1970-01-19 04:28:31.055 2 14 16
1970-01-19 04:28:32.026 3 16 19
1970-01-19 04:28:32.234 4 18 22
1970-01-19 04:28:32.569 5 20 25
1970-01-19 04:28:32.595 6 22 28
1970-01-19 04:28:33.520 7 24 31
1970-01-19 04:28:33.882 8 26 34
1970-01-19 04:28:34.019 9 28 37
df['time'] = df.index
def last_time(time):
time = str(time)
start_time = datetime.datetime.strptime(time[11:],'%H:%M:%S.%f')
end_time = start_time + datetime.timedelta(0,1)
tempdf = df.between_time(*pd.to_datetime([str(start_time),str(end_time)]).time).iloc[-1]
if str(tempdf['time']) == str(time):
tempdf.iloc[:] = np.nan
return tempdf
else:
return tempdf
df.apply(lambda x:last_time(x['time']),axis = 1)
A B C time
1970-01-19 04:28:34.883 2.0 14.0 16.0 1970-01-19 04:28:35.531000
1970-01-19 04:28:34.900 2.0 14.0 16.0 1970-01-19 04:28:35.531000
1970-01-19 04:28:35.531 NaN NaN NaN NaN
1970-01-19 04:28:36.845 4.0 18.0 22.0 1970-01-19 04:28:37.664000
1970-01-19 04:28:37.664 6.0 22.0 28.0 1970-01-19 04:28:38.444000
1970-01-19 04:28:38.332 9.0 28.0 37.0 1970-01-19 04:28:38.951000
1970-01-19 04:28:38.444 9.0 28.0 37.0 1970-01-19 04:28:38.951000
1970-01-19 04:28:38.724 9.0 28.0 37.0 1970-01-19 04:28:38.951000
1970-01-19 04:28:38.787 9.0 28.0 37.0 1970-01-19 04:28:38.951000
1970-01-19 04:28:38.951 NaN NaN NaN NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.