简体   繁体   English

获取时间戳在不规则时间间隔内的行熊猫(时间序列)

[英]Get rows whose timestamps are within irregular time intervals pandas (Time Series)

Let's say I have a dataframe like this:假设我有一个这样的数据框:

>>> i = pd.to_datetime(np.random.randint(time.time(), time.time()+10000, 15), unit='ms').sort_values()
>>> df = pd.DataFrame({'A':range(15),'B':range(10,40,2),'C':range(10,55,3)},index = i)
>>> df
                          A   B   C
1970-01-19 05:31:36.629   0  10  10
1970-01-19 05:31:36.710   1  12  13
1970-01-19 05:31:37.779   2  14  16
1970-01-19 05:31:38.761   3  16  19
1970-01-19 05:31:39.520   4  18  22
1970-01-19 05:31:39.852   5  20  25
1970-01-19 05:31:39.994   6  22  28
1970-01-19 05:31:41.370   7  24  31
1970-01-19 05:31:41.667   8  26  34
1970-01-19 05:31:42.515   9  28  37
1970-01-19 05:31:42.941  10  30  40
1970-01-19 05:31:43.037  11  32  43
1970-01-19 05:31:43.253  12  34  46
1970-01-19 05:31:43.333  13  36  49
1970-01-19 05:31:44.135  14  38  52

What I want is:我想要的是:

                          A   B   C
1970-01-19 05:31:37.779   2.0  14.0  16.0   #last value within 2000 milli-sec interval from 05:31:36
1970-01-19 05:31:38.761   3.0  16.0  19.0      ##last value from the ^ value within 1000 msec interval
1970-01-19 05:31:39.994   6.0  22.0  28.0   #last value within 2000 milli-sec interval from 05:31:38
1970-01-19 05:31:39.994   6.0  22.0  28.0     *##last value from the ^ value within 1000 msec interval
1970-01-19 05:31:41.667   8.0  26.0  34.0   #last value within 2000 milli-sec interval from 05:31:40
1970-01-19 05:31:42.515   9.0  28.0  37.0      ##last value from the ^ value within 1000 msec interval
1970-01-19 05:31:43.333  13.0  36.0  49.0   #last value within 2000 milli-sec interval from 05:31:42
1970-01-19 05:31:44.135  14.0  38.0  52.0      ##last value from the ^ value within 1000 msec interval

I can achieve the rows marked with # s with this code:我可以使用以下代码实现标有# s 的行:

>>> df.resample('2000ms').ffill().dropna(axis=0)
                        A     B     C
1970-01-19 05:31:38   2.0  14.0  16.0
1970-01-19 05:31:40   6.0  22.0  28.0
1970-01-19 05:31:42   8.0  26.0  34.0
1970-01-19 05:31:44  13.0  36.0  49.0

# note I do not care about how the timestamps are getting printed, I just want the correct values.

I can't find a solution with pandas that will give me the desired output.我找不到可以为我提供所需输出的 ​​Pandas 解决方案。 I can do this with two dataframes, one sampled at 2000ms and another at 1000ms and then loop probably, and inserting accordingly.我可以用两个数据帧来做到这一点,一个在2000ms 1000ms采样,另一个在1000ms采样,然后可能循环,并相应地插入。

The problem is, the actual size of my data is really large, with over 4000000 rows and 52 columns.问题是,我的数据的实际大小非常大,超过 4000000 行和 52 列。 If it is possible to avoid two dfs, or loops, I would definitely want to take that.如果可以避免两个 dfs 或循环,我肯定会接受。

NOTE : The * marked row gets repeated, as there are no data within 1000ms time interval from the last value, so the last seen value is repeated.注意*标记的行会重复,因为从最后一个值开始的 1000 毫秒时间间隔内没有数据,因此会重复最后看到的值。 The same should happen for 2000ms time intervals as well.同样的情况也应该发生在 2000 毫秒的时间间隔内。

If possible please show me a way... Thanks!如果可能,请告诉我一种方法...谢谢!

EDIT : Edited as per John Zwinck's comment :编辑:根据John Zwinck 的评论进行编辑:

import datetime
def last_time(time):
    time = str(time)
    start_time = datetime.datetime.strptime(time[11:],'%H:%M:%S.%f')
    end_time = start_time + datetime.timedelta(microseconds=1000000)
    tempdf = df.between_time(*pd.to_datetime([str(start_time),str(end_time)]).time).iloc[-1]
    return tempdf
df['timestamp'] = df.index
df2 = df.resample('2000ms').ffill().dropna(axis=0)
df3 = df2.apply(lambda x:last_time(x['timestamp']), axis = 1)

pd.concat([df2, df3]).sort_index(kind='merge')

This gives:这给出:

                        A     B     C               timestamp
1970-01-19 05:31:38   2.0  14.0  16.0 1970-01-19 05:31:37.779
1970-01-19 05:31:38   3.0  16.0  19.0 1970-01-19 05:31:38.761
1970-01-19 05:31:40   6.0  22.0  28.0 1970-01-19 05:31:39.994
1970-01-19 05:31:40   6.0  22.0  28.0 1970-01-19 05:31:39.994
1970-01-19 05:31:42   8.0  26.0  34.0 1970-01-19 05:31:41.667
1970-01-19 05:31:42   9.0  28.0  37.0 1970-01-19 05:31:42.515
1970-01-19 05:31:44  13.0  36.0  49.0 1970-01-19 05:31:43.333
1970-01-19 05:31:44  14.0  38.0  52.0 1970-01-19 05:31:44.135

Which is okay, except the apply part takes really long time!没关系,除了应用部分需要很长时间!


For easier copy:为了更容易复制:

,A,B,C
1970-01-19 05:31:36.629,0,10,10
1970-01-19 05:31:36.710,1,12,13
1970-01-19 05:31:37.779,2,14,16
1970-01-19 05:31:38.761,3,16,19
1970-01-19 05:31:39.520,4,18,22
1970-01-19 05:31:39.852,5,20,25
1970-01-19 05:31:39.994,6,22,28
1970-01-19 05:31:41.370,7,24,31
1970-01-19 05:31:41.667,8,26,34
1970-01-19 05:31:42.515,9,28,37
1970-01-19 05:31:42.941,10,30,40
1970-01-19 05:31:43.037,11,32,43
1970-01-19 05:31:43.253,12,34,46
1970-01-19 05:31:43.333,13,36,49
1970-01-19 05:31:44.135,14,38,52

The slow part of your existing code is the creation of df3 , so I'll optimize that.现有代码的缓慢部分是df3的创建,因此我将对其进行优化。

First, note that your last_time(x) function looks for the last record within the time range from x to x + 1 second.首先,请注意last_time(x)函数在 x 到 x + 1 秒的时间范围内查找最后一条记录。

Instead of using a loop, we can start by offsetting the time in the entire vector:我们可以从偏移整个向量中的时间开始,而不是使用循环:

end_times = df2.timestamp + datetime.timedelta(microseconds=1000000)

Then we can use numpy.searchsorted() (a highly underrated function!) to search for those times in df :然后我们可以使用numpy.searchsorted() (一个被严重低估的函数!)在df搜索这些时间:

idx = np.searchsorted(df.timestamp, end_times)

Incidentally, df.timestamp.searchsorted(end_times) does the same thing.顺便说一句, df.timestamp.searchsorted(end_times)做同样的事情。

Finally, note that each of those generated indexes is one after what we want (we don't want the values 1 second after, we want the one just before that):最后,请注意,这些生成的索引中的每一个都是我们想要的(我们不想要 1 秒后的值,我们想要在那之前的值):

df3a = df.iloc[idx - 1]

This gives the same result as your df3 except the index is not rounded down, so just replace it:这给出了与df3相同的结果,只是索引没有四舍五入,所以只需替换它:

df3a.index = df2.index

This is exactly the same as your df3 , but calculated much more quickly.这与您的df3完全相同,但计算速度要快得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM