使用非均匀毫秒级日内数据同步和重采样两个时间序列

Question

I see in the python documentation the ability to resample and synchronize two timeseries.我在 python 文档中看到重新采样和同步两个时间序列的能力。 My problem is harder because there is no time regularity in the timeseries.我的问题更难，因为时间序列中没有时间规律。 I read three timeseries that have non-deterministic intraday timestamps.我阅读了三个具有非确定性日内时间戳的时间序列。 However, in order to do most analysis (covariances, correlations, etc) on those two timeseries, I need them to be of the same length.但是，为了对这两个时间序列进行大多数分析（协方差、相关性等），我需要它们的长度相同。

In Matlab, given three time series ts1, ts2, ts3 with non-deterministic intraday timestamps, I can synchronize them by saying在 Matlab 中，给定三个具有非确定性日内时间戳的时间序列ts1, ts2, ts3 ，我可以通过说来同步它们

[ts1, ts2] = synchronize(ts1, ts2, 'union');
[ts1, ts3] = synchronize(ts1, ts3, 'union');
[ts2, ts3] = synchronize(ts2, ts3, 'union');

Note that the time series are already read into a pandas DataFrame, so I need to be able to synchronize (and resample?) with already created DataFrames.请注意，时间序列已被读入 Pandas DataFrame，因此我需要能够与已创建的 DataFrame 同步（和重新采样？）。

Answer 1

According to the Matlab documentation that you've linked to, it sounds like you want to根据您链接到的 Matlab 文档，听起来您想要

Resample timeseries objects using a time vector that is a union of the time vectors of ts1 and ts2 on the time range where the two time vectors overlap.使用时间向量对时间序列对象重新采样，该时间向量是ts1和ts2在两个时间向量重叠的时间范围内的时间向量的并集。

So first you need to find the union of your dataframes' indices:所以首先你需要找到你的数据帧索引的联合：

newindex = df1.index.union(df2.index)

Then you can recreate your dataframes using this index:然后你可以使用这个索引重新创建你的数据帧：

df1 = df1.reindex(newindex)
df2 = df2.reindex(newindex)

Note that they will have NaN s in all of their new entries (presumably this is the same behaviour as in Matlab), it's up to you if you want to fill these, for example fillna(method='pad') will fill in null values using the last known value, or you could use interpolate(method='time') to use linear interpolation based on the timestamps.请注意，它们的所有新条目中都将包含NaN （大概这与 Matlab 中的行为相同），是否要填充这些取决于您，例如fillna(method='pad')将填充 null值使用最后一个已知值，或者您可以使用interpolate(method='time')使用基于时间戳的线性插值。

Answer 2

It is also possible to merge in order to synchronize dataframes.也可以merge以synchronize数据帧。 Especially we might want to syncronize 2 dataframes with 2 different data fields to keep instead of 1. For example, suppose we have these 3 dataframes with temperature & humidity values to sync:特别是我们可能希望将 2 个数据帧与 2 个不同的数据字段同步以保留而不是 1 个。例如，假设我们有这 3 个数据帧与温度和湿度值要同步：

df1

    company_id            log_date  temperature
0            4 2020-02-29 00:00:00         24.0
1            4 2020-02-29 00:03:00         24.0
2            4 2020-02-29 00:06:00         23.9
3            4 2020-02-29 00:09:00         23.8
4            4 2020-02-29 00:12:00         23.8
5            4 2020-02-29 00:15:00         23.7
6            4 2020-02-29 00:18:00         23.6
7            4 2020-02-29 00:21:00         23.5
8            4 2020-02-29 00:24:00         23.4
9            4 2020-02-29 00:27:00         23.3
10           4 2020-02-29 00:30:00         24.0
11           4 2020-02-29 00:33:00         21.0
12           4 2020-02-29 00:36:00         22.9
13           4 2020-02-29 00:39:00         23.8
14           4 2020-02-29 00:42:00         22.8
15           4 2020-02-29 00:45:00         21.7
16           4 2020-02-29 00:48:00         22.6
17           4 2020-02-29 00:51:00         21.5

df2

   company_id            log_date  humidity
0           4 2020-02-29 00:00:00     74.92
1           4 2020-02-29 00:05:00     75.00
2           4 2020-02-29 00:10:00     73.10
3           4 2020-02-29 00:15:00     72.10
4           4 2020-02-29 00:20:00     72.00
5           4 2020-02-29 00:25:00     73.00
6           4 2020-02-29 00:30:00     74.00
7           4 2020-02-29 00:35:00     72.10
8           4 2020-02-29 00:45:00     69.00
9           4 2020-02-29 00:50:00     71.92

df3

   company_id            log_date  temperature
0           4 2020-02-29 00:00:00        20.00
1           4 2020-02-29 00:05:00        21.00
2           4 2020-02-29 00:10:00        22.00
3           4 2020-02-29 00:15:00        23.00
4           4 2020-02-29 00:20:00        23.10
5           4 2020-02-29 00:25:00        22.00
6           4 2020-02-29 00:30:00        22.00
7           4 2020-02-29 00:35:00        22.10
8           4 2020-02-29 00:45:00        23.00
9           4 2020-02-29 00:50:00        21.92

We can do something like that我们可以做这样的事情

df1['log_date'] = pd.to_datetime(df1['log_date'])
df2['log_date'] = pd.to_datetime(df2['log_date'])
df3['log_date'] = pd.to_datetime(df3['log_date'])

df_a = pd.merge_asof(df1, df2, on="log_date", by="company_id", tolerance=pd.Timedelta("5m"))
df_b = pd.merge_asof(df1, df3, on="log_date", by="company_id", tolerance=pd.Timedelta("5m"))

And the resultant dataframes;以及由此产生的数据帧；

df_a

    company_id            log_date  temperature  humidity
0            4 2020-02-29 00:00:00         24.0     74.92
1            4 2020-02-29 00:03:00         24.0     74.92
2            4 2020-02-29 00:06:00         23.9     75.00
3            4 2020-02-29 00:09:00         23.8     75.00
4            4 2020-02-29 00:12:00         23.8     73.10
5            4 2020-02-29 00:15:00         23.7     72.10
6            4 2020-02-29 00:18:00         23.6     72.10
7            4 2020-02-29 00:21:00         23.5     72.00
8            4 2020-02-29 00:24:00         23.4     72.00
9            4 2020-02-29 00:27:00         23.3     73.00
10           4 2020-02-29 00:30:00         24.0     74.00
11           4 2020-02-29 00:33:00         21.0     74.00
12           4 2020-02-29 00:36:00         22.9     72.10
13           4 2020-02-29 00:39:00         23.8     72.10
14           4 2020-02-29 00:42:00         22.8       NaN
15           4 2020-02-29 00:45:00         21.7     69.00
16           4 2020-02-29 00:48:00         22.6     69.00
17           4 2020-02-29 00:51:00         21.5     71.92

df_b

    company_id            log_date  temperature_x  temperature_y
0            4 2020-02-29 00:00:00           24.0          20.00
1            4 2020-02-29 00:03:00           24.0          20.00
2            4 2020-02-29 00:06:00           23.9          21.00
3            4 2020-02-29 00:09:00           23.8          21.00
4            4 2020-02-29 00:12:00           23.8          22.00
5            4 2020-02-29 00:15:00           23.7          23.00
6            4 2020-02-29 00:18:00           23.6          23.00
7            4 2020-02-29 00:21:00           23.5          23.10
8            4 2020-02-29 00:24:00           23.4          23.10
9            4 2020-02-29 00:27:00           23.3          22.00
10           4 2020-02-29 00:30:00           24.0          22.00
11           4 2020-02-29 00:33:00           21.0          22.00
12           4 2020-02-29 00:36:00           22.9          22.10
13           4 2020-02-29 00:39:00           23.8          22.10
14           4 2020-02-29 00:42:00           22.8            NaN
15           4 2020-02-29 00:45:00           21.7          23.00
16           4 2020-02-29 00:48:00           22.6          23.00
17           4 2020-02-29 00:51:00           21.5          21.92

In the first we have 2 different data fields temperature & humidity , and in the second we have 2 different versions of temperature .第一个我们有 2 个不同的数据字段temperature和humidity ，第二个我们有 2 个不同版本的temperature 。 This could be something you are trying to achieve.这可能是您正在努力实现的目标。

使用非均匀毫秒级日内数据同步和重采样两个时间序列

问题描述

2 个解决方案

解决方案1
6 已采纳 2015-11-23 22:15:11

解决方案2
0 2021-06-02 22:54:46

使用非均匀毫秒级日内数据同步和重采样两个时间序列

问题描述

2 个解决方案

解决方案1 6 已采纳 2015-11-23 22:15:11

解决方案2 0 2021-06-02 22:54:46

解决方案1
6 已采纳 2015-11-23 22:15:11

解决方案2
0 2021-06-02 22:54:46