获取时间戳在特定滑动 window 时间间隔 pandas （时间序列）内的行

Question

I have a dataframe like this:我有一个像这样的 dataframe：

i = pd.to_datetime(np.random.randint(time.time(), time.time()+5000, 10), unit='ms').sort_values()
df = pd.DataFrame({'A':range(10),'B':range(10,30,2),'C':range(10,40,3)},index = i)

df
                         A   B   C
1970-01-19 04:28:30.030  0  10  10
1970-01-19 04:28:30.374  1  12  13
1970-01-19 04:28:31.055  2  14  16
1970-01-19 04:28:32.026  3  16  19
1970-01-19 04:28:32.234  4  18  22
1970-01-19 04:28:32.569  5  20  25
1970-01-19 04:28:32.595  6  22  28
1970-01-19 04:28:33.520  7  24  31
1970-01-19 04:28:33.882  8  26  34
1970-01-19 04:28:34.019  9  28  37

What I want is, for each index, the last row which is within '1s' interval from that index:我想要的是，对于每个索引，在该索引的“1s”间隔内的最后一行：

df2
                                    ix            A   B   C
1970-01-19 04:28:30.030  1970-01-19 04:28:30.374  1  12  13
1970-01-19 04:28:30.374  1970-01-19 04:28:31.055  2  14  16
1970-01-19 04:28:31.055  1970-01-19 04:28:32.026  3  16  19
1970-01-19 04:28:32.026  1970-01-19 04:28:32.595  6  22  28
1970-01-19 04:28:32.234  1970-01-19 04:28:32.595  6  22  28
1970-01-19 04:28:32.569  1970-01-19 04:28:33.520  7  24  31
1970-01-19 04:28:32.595  1970-01-19 04:28:33.520  7  24  31
1970-01-19 04:28:33.520  1970-01-19 04:28:34.019  9  28  37
1970-01-19 04:28:33.882  1970-01-19 04:28:34.019  9  28  37
1970-01-19 04:28:34.019             nan          nan nan nan

I am currently doing this with loops.我目前正在使用循环执行此操作。 At each index I am using df.between_time to get all the rows in the time interval and then selecting the last row.在每个索引处，我使用df.between_time来获取时间间隔中的所有行，然后选择最后一行。 But it is really slow, as expected.但正如预期的那样，它真的很慢。 I need something like df.shift for time, I checked out tshift and shift(periods = 1, freq = 'S') but they do not work like shift, rather they add specified time to each index.我需要df.shift的时间，我检查了tshift和shift(periods = 1, freq = 'S')但它们不像 shift 那样工作，而是为每个索引添加指定的时间。 Can somebody help me in achieving this?有人可以帮助我实现这一目标吗？ Thanks.谢谢。

Note: The ix columns in the desired output is optional.注意：所需 output 中的ix列是可选的。

PS: If a min_periods parameter (like pd.df.rolling ) is possible, that would be great! PS：如果min_periods参数（如pd.df.rolling ）是可能的，那就太好了！

EDIT:编辑：

For a starting df:对于起始df：

                         A   B   C
1970-01-19 04:28:34.883  0  10  10
1970-01-19 04:28:34.900  1  12  13
1970-01-19 04:28:35.531  2  14  16
1970-01-19 04:28:36.845  3  16  19
1970-01-19 04:28:37.664  4  18  22
1970-01-19 04:28:38.332  5  20  25
1970-01-19 04:28:38.444  6  22  28
1970-01-19 04:28:38.724  7  24  31
1970-01-19 04:28:38.787  8  26  34
1970-01-19 04:28:38.951  9  28  37

df['time'] = df.index
def last_time(time):
    time = str(time)
    start_time = datetime.datetime.strptime(time[11:],'%H:%M:%S.%f')
    end_time = start_time + datetime.timedelta(0,1)
    return df.between_time(start_time = str(start_time)[11:-7],end_time= 
                                        str(end_time)[11:-7]).iloc[-1]
df.apply(lambda x:last_time(x['time']),axis = 1)

# Output:
                         A   B   C                    time
1970-01-19 04:28:34.883  1  12  13 1970-01-19 04:28:34.900
1970-01-19 04:28:34.900  1  12  13 1970-01-19 04:28:34.900
1970-01-19 04:28:35.531  2  14  16 1970-01-19 04:28:35.531
1970-01-19 04:28:36.845  3  16  19 1970-01-19 04:28:36.845
1970-01-19 04:28:37.664  4  18  22 1970-01-19 04:28:37.664
1970-01-19 04:28:38.332  9  28  37 1970-01-19 04:28:38.951
1970-01-19 04:28:38.444  9  28  37 1970-01-19 04:28:38.951
1970-01-19 04:28:38.724  9  28  37 1970-01-19 04:28:38.951

But as you can see, I can only get second level accuracy, that is it is considering between 34 to 35 , hence it is missing 35.531 which is within interval from both 34.883 and 34.900 .但是正如您所看到的，我只能获得second精度，即它正在考虑34 to 35之间，因此它缺少35.531 ，它在34.883和34.900的区间内。

Answer 1

assuming your time is sorted, then the corresponding row for row 2 would be strictly larger than that for row 1. eg: if row 6 is the row for row1, then row2 would only need to search row that is >=6假设您的时间已排序，那么第 2 行的相应行将严格大于第 1 行的行。例如：如果第 6 行是第 1 行的行，则第 2 行只需要搜索 >=6 的行

With this in mind we just need to loop through the index once(complexity linear: O(n)):考虑到这一点，我们只需要遍历索引一次（复杂度线性：O（n））：

import pandas as pd
from datetime import datetime

def time_compare(t1,t2):
     return datetime.strptime(t1,'%Y-%m-%d %H:%M:%S.%f').timestamp() - datetime.strptime(t2,'%Y-%m-%d %H:%M:%S.%f').timestamp() < 1

index_j = []
cursor = 0
tmp = list(df.index)
for i in tmp:
    if cursor < len(tmp):
        pass
    else:
        index_j.append(cursor-1)
        continue
    while time_compare(tmp[cursor],i):
        cursor += 1
        if cursor < len(tmp):
            pass
        else:
            break
    index_j.append(cursor-1)

Using this df:使用这个df：

>>> df
                         A   B   C
1970-01-19 04:28:34.883  0  10  10
1970-01-19 04:28:34.900  1  12  13
1970-01-19 04:28:35.531  2  14  16
1970-01-19 04:28:36.845  3  16  19
1970-01-19 04:28:37.664  4  18  22
1970-01-19 04:28:38.332  5  20  25
1970-01-19 04:28:38.444  6  22  28
1970-01-19 04:28:38.724  7  24  31
1970-01-19 04:28:38.787  8  26  34
1970-01-19 04:28:38.951  9  28  37



>>> index_j
[2, 2, 2, 4, 6, 9, 9, 9, 9, 9]

Using the index:使用索引：

>>> [tmp[i] for i in index_j]
['1970-01-19 04:28:35.531', '1970-01-19 04:28:35.531', '1970-01-19 04:28:35.531', '1970-01-19 04:28:37.664', '1970-01-19 04:28:38.444', '1970-01-19 04:28:38.951', '1970-01-19 04:28:38.951', '1970-01-19 04:28:38.951', '1970-01-19 04:28:38.951', '1970-01-19 04:28:38.951']

Answer 2

I kind of got an answer, hence sharing, if anyone has a better answer you are most welcome to add it.我有点得到答案，因此分享，如果有人有更好的答案，欢迎您添加它。

i = pd.to_datetime(np.random.randint(time.time(), time.time()+5000, 10), unit='ms').sort_values()
df = pd.DataFrame({'A':range(10),'B':range(10,30,2),'C':range(10,40,3)},index = i)
df
df
                         A   B   C
1970-01-19 04:28:30.030  0  10  10
1970-01-19 04:28:30.374  1  12  13
1970-01-19 04:28:31.055  2  14  16
1970-01-19 04:28:32.026  3  16  19
1970-01-19 04:28:32.234  4  18  22
1970-01-19 04:28:32.569  5  20  25
1970-01-19 04:28:32.595  6  22  28
1970-01-19 04:28:33.520  7  24  31
1970-01-19 04:28:33.882  8  26  34
1970-01-19 04:28:34.019  9  28  37

df['time'] = df.index
def last_time(time):
    time = str(time)
    start_time = datetime.datetime.strptime(time[11:],'%H:%M:%S.%f')
    end_time = start_time + datetime.timedelta(0,1)
    tempdf = df.between_time(*pd.to_datetime([str(start_time),str(end_time)]).time).iloc[-1]
    if str(tempdf['time']) == str(time):
        tempdf.iloc[:] = np.nan
        return tempdf
    else:
        return tempdf
df.apply(lambda x:last_time(x['time']),axis = 1)

                           A     B     C                        time
1970-01-19 04:28:34.883  2.0  14.0  16.0  1970-01-19 04:28:35.531000
1970-01-19 04:28:34.900  2.0  14.0  16.0  1970-01-19 04:28:35.531000
1970-01-19 04:28:35.531  NaN   NaN   NaN                         NaN
1970-01-19 04:28:36.845  4.0  18.0  22.0  1970-01-19 04:28:37.664000
1970-01-19 04:28:37.664  6.0  22.0  28.0  1970-01-19 04:28:38.444000
1970-01-19 04:28:38.332  9.0  28.0  37.0  1970-01-19 04:28:38.951000
1970-01-19 04:28:38.444  9.0  28.0  37.0  1970-01-19 04:28:38.951000
1970-01-19 04:28:38.724  9.0  28.0  37.0  1970-01-19 04:28:38.951000
1970-01-19 04:28:38.787  9.0  28.0  37.0  1970-01-19 04:28:38.951000
1970-01-19 04:28:38.951  NaN   NaN   NaN                         NaN

获取时间戳在特定滑动 window 时间间隔 pandas （时间序列）内的行

问题描述

2 个解决方案

解决方案1
1 2019-10-17 14:15:21

解决方案2
0 已采纳 2019-10-17 12:43:44

获取时间戳在特定滑动 window 时间间隔 pandas （时间序列）内的行

问题描述

2 个解决方案

解决方案1 1 2019-10-17 14:15:21

解决方案2 0 已采纳 2019-10-17 12:43:44

解决方案1
1 2019-10-17 14:15:21

解决方案2
0 已采纳 2019-10-17 12:43:44