简体   繁体   English

使用非均匀毫秒级日内数据同步和重采样两个时间序列

[英]Synchronizing and Resampling two timeseries with non-uniform millisecond intraday data

I see in the python documentation the ability to resample and synchronize two timeseries.我在 python 文档中看到重新采样和同步两个时间序列的能力。 My problem is harder because there is no time regularity in the timeseries.我的问题更难,因为时间序列中没有时间规律。 I read three timeseries that have non-deterministic intraday timestamps.我阅读了三个具有非确定性日内时间戳的时间序列。 However, in order to do most analysis (covariances, correlations, etc) on those two timeseries, I need them to be of the same length.但是,为了对这两个时间序列进行大多数分析(协方差、相关性等),我需要它们的长度相同。

In Matlab, given three time series ts1, ts2, ts3 with non-deterministic intraday timestamps, I can synchronize them by saying在 Matlab 中,给定三个具有非确定性日内时间戳的时间序列ts1, ts2, ts3 ,我可以通过说来同步它们

[ts1, ts2] = synchronize(ts1, ts2, 'union');
[ts1, ts3] = synchronize(ts1, ts3, 'union');
[ts2, ts3] = synchronize(ts2, ts3, 'union');

Note that the time series are already read into a pandas DataFrame, so I need to be able to synchronize (and resample?) with already created DataFrames.请注意,时间序列已被读入 Pandas DataFrame,因此我需要能够与已创建的 DataFrame 同步(和重新采样?)。

According to the Matlab documentation that you've linked to, it sounds like you want to根据您链接到的 Matlab 文档,听起来您想要

Resample timeseries objects using a time vector that is a union of the time vectors of ts1 and ts2 on the time range where the two time vectors overlap.使用时间向量对时间序列对象重新采样,该时间向量是ts1ts2在两个时间向量重叠的时间范围内的时间向量的并集。

So first you need to find the union of your dataframes' indices:所以首先你需要找到你的数据帧索引的联合:

newindex = df1.index.union(df2.index)

Then you can recreate your dataframes using this index:然后你可以使用这个索引重新创建你的数据帧:

df1 = df1.reindex(newindex)
df2 = df2.reindex(newindex)

Note that they will have NaN s in all of their new entries (presumably this is the same behaviour as in Matlab), it's up to you if you want to fill these, for example fillna(method='pad') will fill in null values using the last known value, or you could use interpolate(method='time') to use linear interpolation based on the timestamps.请注意,它们的所有新条目中都将包含NaN (大概这与 Matlab 中的行为相同),是否要填充这些取决于您,例如fillna(method='pad')将填充 null值使用最后一个已知值,或者您可以使用interpolate(method='time')使用基于时间戳的线性插值。

It is also possible to merge in order to synchronize dataframes.也可以mergesynchronize数据帧。 Especially we might want to syncronize 2 dataframes with 2 different data fields to keep instead of 1. For example, suppose we have these 3 dataframes with temperature & humidity values to sync:特别是我们可能希望将 2 个数据帧与 2 个不同的数据字段同步以保留而不是 1 个。例如,假设我们有这 3 个数据帧与温度和湿度值要同步:

df1

    company_id            log_date  temperature
0            4 2020-02-29 00:00:00         24.0
1            4 2020-02-29 00:03:00         24.0
2            4 2020-02-29 00:06:00         23.9
3            4 2020-02-29 00:09:00         23.8
4            4 2020-02-29 00:12:00         23.8
5            4 2020-02-29 00:15:00         23.7
6            4 2020-02-29 00:18:00         23.6
7            4 2020-02-29 00:21:00         23.5
8            4 2020-02-29 00:24:00         23.4
9            4 2020-02-29 00:27:00         23.3
10           4 2020-02-29 00:30:00         24.0
11           4 2020-02-29 00:33:00         21.0
12           4 2020-02-29 00:36:00         22.9
13           4 2020-02-29 00:39:00         23.8
14           4 2020-02-29 00:42:00         22.8
15           4 2020-02-29 00:45:00         21.7
16           4 2020-02-29 00:48:00         22.6
17           4 2020-02-29 00:51:00         21.5

df2

   company_id            log_date  humidity
0           4 2020-02-29 00:00:00     74.92
1           4 2020-02-29 00:05:00     75.00
2           4 2020-02-29 00:10:00     73.10
3           4 2020-02-29 00:15:00     72.10
4           4 2020-02-29 00:20:00     72.00
5           4 2020-02-29 00:25:00     73.00
6           4 2020-02-29 00:30:00     74.00
7           4 2020-02-29 00:35:00     72.10
8           4 2020-02-29 00:45:00     69.00
9           4 2020-02-29 00:50:00     71.92

df3

   company_id            log_date  temperature
0           4 2020-02-29 00:00:00        20.00
1           4 2020-02-29 00:05:00        21.00
2           4 2020-02-29 00:10:00        22.00
3           4 2020-02-29 00:15:00        23.00
4           4 2020-02-29 00:20:00        23.10
5           4 2020-02-29 00:25:00        22.00
6           4 2020-02-29 00:30:00        22.00
7           4 2020-02-29 00:35:00        22.10
8           4 2020-02-29 00:45:00        23.00
9           4 2020-02-29 00:50:00        21.92

We can do something like that我们可以做这样的事情

df1['log_date'] = pd.to_datetime(df1['log_date'])
df2['log_date'] = pd.to_datetime(df2['log_date'])
df3['log_date'] = pd.to_datetime(df3['log_date'])

df_a = pd.merge_asof(df1, df2, on="log_date", by="company_id", tolerance=pd.Timedelta("5m"))
df_b = pd.merge_asof(df1, df3, on="log_date", by="company_id", tolerance=pd.Timedelta("5m"))

And the resultant dataframes;以及由此产生的数据帧;

df_a

    company_id            log_date  temperature  humidity
0            4 2020-02-29 00:00:00         24.0     74.92
1            4 2020-02-29 00:03:00         24.0     74.92
2            4 2020-02-29 00:06:00         23.9     75.00
3            4 2020-02-29 00:09:00         23.8     75.00
4            4 2020-02-29 00:12:00         23.8     73.10
5            4 2020-02-29 00:15:00         23.7     72.10
6            4 2020-02-29 00:18:00         23.6     72.10
7            4 2020-02-29 00:21:00         23.5     72.00
8            4 2020-02-29 00:24:00         23.4     72.00
9            4 2020-02-29 00:27:00         23.3     73.00
10           4 2020-02-29 00:30:00         24.0     74.00
11           4 2020-02-29 00:33:00         21.0     74.00
12           4 2020-02-29 00:36:00         22.9     72.10
13           4 2020-02-29 00:39:00         23.8     72.10
14           4 2020-02-29 00:42:00         22.8       NaN
15           4 2020-02-29 00:45:00         21.7     69.00
16           4 2020-02-29 00:48:00         22.6     69.00
17           4 2020-02-29 00:51:00         21.5     71.92

df_b

    company_id            log_date  temperature_x  temperature_y
0            4 2020-02-29 00:00:00           24.0          20.00
1            4 2020-02-29 00:03:00           24.0          20.00
2            4 2020-02-29 00:06:00           23.9          21.00
3            4 2020-02-29 00:09:00           23.8          21.00
4            4 2020-02-29 00:12:00           23.8          22.00
5            4 2020-02-29 00:15:00           23.7          23.00
6            4 2020-02-29 00:18:00           23.6          23.00
7            4 2020-02-29 00:21:00           23.5          23.10
8            4 2020-02-29 00:24:00           23.4          23.10
9            4 2020-02-29 00:27:00           23.3          22.00
10           4 2020-02-29 00:30:00           24.0          22.00
11           4 2020-02-29 00:33:00           21.0          22.00
12           4 2020-02-29 00:36:00           22.9          22.10
13           4 2020-02-29 00:39:00           23.8          22.10
14           4 2020-02-29 00:42:00           22.8            NaN
15           4 2020-02-29 00:45:00           21.7          23.00
16           4 2020-02-29 00:48:00           22.6          23.00
17           4 2020-02-29 00:51:00           21.5          21.92

In the first we have 2 different data fields temperature & humidity , and in the second we have 2 different versions of temperature .第一个我们有 2 个不同的数据字段temperaturehumidity ,第二个我们有 2 个不同版本的temperature This could be something you are trying to achieve.这可能是您正在努力实现的目标。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM