[英]Time series correlation with pandas
I have some Particulate Matter sensors and CSVs with time series like: 我有一些Particulate Matter传感器和CSV,时间序列如下:
Sensor A: 传感器A:
date value
date
2017-11-30 00:00:00 30/11/17 0.00 49
2017-11-30 00:02:00 30/11/17 0.02 51
2017-11-30 00:03:00 30/11/17 0.03 54
2017-11-30 00:05:00 30/11/17 0.05 57
2017-11-30 00:07:00 30/11/17 0.07 53
2017-11-30 00:08:00 30/11/17 0.08 55
2017-11-30 00:10:00 30/11/17 0.10 55
2017-11-30 00:12:00 30/11/17 0.12 58
2017-11-30 00:13:00 30/11/17 0.13 57
2017-11-30 00:15:00 30/11/17 0.15 58
....
2018-02-06 09:30:00 6/2/18 9.30 33
2018-02-06 09:32:00 6/2/18 9.32 31
2018-02-06 09:33:00 6/2/18 9.33 34
2018-02-06 09:35:00 6/2/18 9.35 32
2018-02-06 09:37:00 6/2/18 9.37 33
2018-02-06 09:38:00 6/2/18 9.38 30
I set date as index with: 我将日期设置为索引:
df.index = pd.to_datetime(df['date'], format='%d/%m/%y %H.%M')
I would like to correlate different time windows between data from the same sensor AND from different sensor in similar time windows. 我想在相似的时间窗口中关联来自同一传感器和来自不同传感器的数据之间的不同时间窗口。 I expect to know if I have same increase/decrease behaviour in some part of the day/days.
我希望知道在一天中的某些部分是否有相同的增加/减少行为。 After setting "date index" I'm able to get "All PM value from 9am to 10am everyday from sensor A"
设置“日期索引”后,我能够从传感器A每天从上午9点到上午10点得到“所有PM值”
df.between_time('9:00','10:00')
1) Problem 1: How to check correlation from same sensor but different days : I filtered data 9/10am from two days in two DataFrame, but not always they're taken exactly at the same minute. 1) 问题1:如何检查相同传感器的相关性,但不同的天数 :我在两个DataFrame中从两天过滤数据9 / 10am,但并不总是在同一分钟内完全采集。 I may have situations like this:
我可能有这样的情况:
01-01-2018 (df01 - I removed data column)
2018-01-01 09:05:00 11
2018-01-01 09:07:00 11
2018-01-01 09:09:00 10
....
02-01-2018 (df02)
2018-02-01 09:05:00 67
2018-02-01 09:07:00 68
2018-02-01 09:08:00 67
....
Should I rename data column? 我应该重命名数据列吗? I actually care that the third value from 01/01/2018 will correlate with the third value from the second window.
我实际上关心的是,01/01/2018的第三个值将与第二个窗口的第三个值相关联。
df01.corr(df02)
returns 回报
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
2) Problem 2: Correlate between different sensors In this case I have 2 CVS files with PM values from two sensors. 2) 问题2:不同传感器之间的关联在这种情况下,我有2个CVS文件,其PM值来自两个传感器。 As Problem1 I would like to correlate same time windows from them.
问题1我想关联他们的相同时间窗口。 Even in this case I expect some "Casual lag" between data but errors between minutes are fine and I want to check just values 'at right position'.
即使在这种情况下,我预计数据之间会有一些“休闲滞后”,但是几分钟之间的误差很好,我想只检查“正确位置”的值。 Example:
例:
Sensor A:
date value
date
2017-11-30 00:00:00 30/11/17 0.00 49
2017-11-30 00:02:00 30/11/17 0.02 51
2017-11-30 00:03:00 30/11/17 0.03 54
2017-11-30 00:05:00 30/11/17 0.05 57
Sensor B:
date value
date
2017-11-30 00:00:00 30/11/17 0.00 1
2017-11-30 00:02:00 30/11/17 0.02 40
2017-11-30 00:04:00 30/11/17 0.03 11
2017-11-30 00:05:00 30/11/17 0.05 57
AxB
date valueA valueB
date
2017-11-30 00:00:00 30/11/17 0.00 49 1
2017-11-30 00:02:00 30/11/17 0.02 51 40
2017-11-30 00:03:00 30/11/17 0.03 54 11
2017-11-30 00:05:00 30/11/17 0.05 57 57
Thank you in advance 先感谢您
I'll try to address both of your questions together. 我会尽力解决你的两个问题。 This looks like a job for
pd.merge_asof()
, which merges on nearest-matching keys, rather than only on exact keys. 这看起来像是
pd.merge_asof()
的作业,它在最近匹配的键上合并,而不是仅在精确键上合并。
df1
date value
30/11/17 0.00 51
30/11/17 0.02 53
30/11/17 0.05 65
30/11/17 0.08 58
df2
date value
30/11/17 0.01 61
30/11/17 0.02 63
30/11/17 0.04 65
30/11/17 0.07 68
df1.date = pd.to_datetime(df1.date, format='%d/%m/%y %H.%M')
df2.date = pd.to_datetime(df2.date, format='%d/%m/%y %H.%M')
df1.set_index('date', inplace=True)
df2.set_index('date', inplace=True)
df1
value
date
2017-11-30 00:00:00 51
2017-11-30 00:02:00 53
2017-11-30 00:05:00 65
2017-11-30 00:08:00 58
df2
value
date
2017-11-30 00:01:00 61
2017-11-30 00:02:00 63
2017-11-30 00:04:00 65
2017-11-30 00:07:00 68
merged = pd.merge_asof(df1, df2, left_index=True, right_index=True, direction='nearest')
merged
value_x value_y
date
2017-11-30 00:00:00 51 61
2017-11-30 00:02:00 53 63
2017-11-30 00:05:00 65 65
2017-11-30 00:08:00 58 68
Note that df.corr()
doesn't accept data as an argument, so df1.corr(df2)
doesn't work. 请注意,
df.corr()
不接受数据作为参数,因此df1.corr(df2)
不起作用。 The corr
method computes pairwise correlation of the columns in the DataFrame you call it on ( docs ). corr
方法计算您调用它的DataFrame中的列的成对相关性( docs )。
merged.corr()
value_x value_y
value_x 1.000000 0.612873
value_y 0.612873 1.000000
The above usage of pd.merge_asof
keeps the index of df1
; pd.merge_asof
的上述用法保留了df1
的索引; each row in df1
receives its closest match in df2
, with replacement , so if df2
ever has fewer rows than df1
, the result of merge_asof
will contain duplicate values from df2
. df1
每一行在df2
接收其最接近的匹配, 具有替换 ,因此如果df2
的行数少于df1
,则merge_asof
的结果将包含来自df2
重复值。 And the result will have the same number of rows as df1
. 结果将与
df1
具有相同的行数。
You mentioned that you really only care to compare rows by relative position, eg, compare the 3rd value of df1
to the 3rd value of df2
. 您提到您实际上只关心通过相对位置比较行,例如,将
df1
的第3个值与df2
的第3个值进行比较。 Instead of using merge_asof
, you could simply ignore the time index once you've used it to obtain time periods of interest, and access the underlying numpy arrays with df.values
: 您可以简单地忽略时间索引,而不是使用
merge_asof
,一旦您使用它来获取感兴趣的时间段,并使用df.values
访问底层的numpy数组:
# Get a 2D array of shape (4, 1)
df1.values
array([[51],
[53],
[65],
[58]])
# Get a 1D array of shape (4,)
df1.values.flatten()
array([51, 53, 65, 58])
# numpy correlation matrix
pd.np.corrcoef(df1.values.flatten(), df2.values.flatten())
array([[1. , 0.61287265],
[0.61287265, 1. ]])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.