简体   繁体   English

与熊猫的时间序列相关性

[英]Time series correlation with pandas

I have some Particulate Matter sensors and CSVs with time series like: 我有一些Particulate Matter传感器和CSV,时间序列如下:

Sensor A: 传感器A:

                     date           value
date                                     
2017-11-30 00:00:00  30/11/17 0.00     49
2017-11-30 00:02:00  30/11/17 0.02     51
2017-11-30 00:03:00  30/11/17 0.03     54
2017-11-30 00:05:00  30/11/17 0.05     57
2017-11-30 00:07:00  30/11/17 0.07     53
2017-11-30 00:08:00  30/11/17 0.08     55
2017-11-30 00:10:00  30/11/17 0.10     55
2017-11-30 00:12:00  30/11/17 0.12     58
2017-11-30 00:13:00  30/11/17 0.13     57
2017-11-30 00:15:00  30/11/17 0.15     58
....
2018-02-06 09:30:00    6/2/18 9.30     33
2018-02-06 09:32:00    6/2/18 9.32     31
2018-02-06 09:33:00    6/2/18 9.33     34
2018-02-06 09:35:00    6/2/18 9.35     32
2018-02-06 09:37:00    6/2/18 9.37     33
2018-02-06 09:38:00    6/2/18 9.38     30

I set date as index with: 我将日期设置为索引:

df.index = pd.to_datetime(df['date'], format='%d/%m/%y %H.%M')

I would like to correlate different time windows between data from the same sensor AND from different sensor in similar time windows. 我想在相似的时间窗口中关联来自同一传感器和来自不同传感器的数据之间的不同时间窗口。 I expect to know if I have same increase/decrease behaviour in some part of the day/days. 我希望知道在一天中的某些部分是否有相同的增加/减少行为。 After setting "date index" I'm able to get "All PM value from 9am to 10am everyday from sensor A" 设置“日期索引”后,我能够从传感器A每天从上午9点到上午10点得到“所有PM值”

df.between_time('9:00','10:00')

1) Problem 1: How to check correlation from same sensor but different days : I filtered data 9/10am from two days in two DataFrame, but not always they're taken exactly at the same minute. 1) 问题1:如何检查相同传感器的相关性,但不同的天数 :我在两个DataFrame中从两天过滤数据9 / 10am,但并不总是在同一分钟内完全采集。 I may have situations like this: 我可能有这样的情况:

01-01-2018 (df01 - I removed data column)
2018-01-01 09:05:00     11
2018-01-01 09:07:00     11
2018-01-01 09:09:00     10
....


02-01-2018 (df02)
2018-02-01 09:05:00     67
2018-02-01 09:07:00     68
2018-02-01 09:08:00     67
....

Should I rename data column? 我应该重命名数据列吗? I actually care that the third value from 01/01/2018 will correlate with the third value from the second window. 我实际上关心的是,01/01/2018的第三个值将与第二个窗口的第三个值相关联。

df01.corr(df02)

returns 回报

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

2) Problem 2: Correlate between different sensors In this case I have 2 CVS files with PM values from two sensors. 2) 问题2:不同传感器之间的关联在这种情况下,我有2个CVS文件,其PM值来自两个传感器。 As Problem1 I would like to correlate same time windows from them. 问题1我想关联他们的相同时间窗口。 Even in this case I expect some "Casual lag" between data but errors between minutes are fine and I want to check just values 'at right position'. 即使在这种情况下,我预计数据之间会有一些“休闲滞后”,但是几分钟之间的误差很好,我想只检查“正确位置”的值。 Example: 例:

Sensor A:
                         date           value
    date                                     
    2017-11-30 00:00:00  30/11/17 0.00     49
    2017-11-30 00:02:00  30/11/17 0.02     51
    2017-11-30 00:03:00  30/11/17 0.03     54
    2017-11-30 00:05:00  30/11/17 0.05     57

Sensor B:
                         date           value
    date                                     
    2017-11-30 00:00:00  30/11/17 0.00     1
    2017-11-30 00:02:00  30/11/17 0.02     40
    2017-11-30 00:04:00  30/11/17 0.03     11
    2017-11-30 00:05:00  30/11/17 0.05     57

AxB
                         date           valueA    valueB
    date                                     
    2017-11-30 00:00:00  30/11/17 0.00     49       1
    2017-11-30 00:02:00  30/11/17 0.02     51       40
    2017-11-30 00:03:00  30/11/17 0.03     54       11
    2017-11-30 00:05:00  30/11/17 0.05     57       57

Thank you in advance 先感谢您

I'll try to address both of your questions together. 我会尽力解决你的两个问题。 This looks like a job for pd.merge_asof() , which merges on nearest-matching keys, rather than only on exact keys. 这看起来像是pd.merge_asof()的作业,它在最近匹配的键上合并,而不是仅在精确键上合并。

Example data 示例数据

df1
date            value
30/11/17 0.00   51
30/11/17 0.02   53
30/11/17 0.05   65
30/11/17 0.08   58

df2
date            value
30/11/17 0.01   61
30/11/17 0.02   63
30/11/17 0.04   65
30/11/17 0.07   68

Preprocessing 预处理

df1.date = pd.to_datetime(df1.date, format='%d/%m/%y %H.%M')
df2.date = pd.to_datetime(df2.date, format='%d/%m/%y %H.%M')
df1.set_index('date', inplace=True)
df2.set_index('date', inplace=True)

df1
                     value
date
2017-11-30 00:00:00     51
2017-11-30 00:02:00     53
2017-11-30 00:05:00     65
2017-11-30 00:08:00     58

df2
                     value
date
2017-11-30 00:01:00     61
2017-11-30 00:02:00     63
2017-11-30 00:04:00     65
2017-11-30 00:07:00     68

Merge DataFrames on nearest index match 在最近的索引匹配上合并DataFrame

merged = pd.merge_asof(df1, df2, left_index=True, right_index=True, direction='nearest')
merged
                         value_x  value_y
date
2017-11-30 00:00:00       51       61
2017-11-30 00:02:00       53       63
2017-11-30 00:05:00       65       65
2017-11-30 00:08:00       58       68

Correlations 相关性

Note that df.corr() doesn't accept data as an argument, so df1.corr(df2) doesn't work. 请注意, df.corr()不接受数据作为参数,因此df1.corr(df2)不起作用。 The corr method computes pairwise correlation of the columns in the DataFrame you call it on ( docs ). corr方法计算您调用它的DataFrame中的列的成对相关性( docs )。

merged.corr()
          value_x   value_y
value_x  1.000000  0.612873
value_y  0.612873  1.000000

Notes 笔记

The above usage of pd.merge_asof keeps the index of df1 ; pd.merge_asof的上述用法保留了df1的索引; each row in df1 receives its closest match in df2 , with replacement , so if df2 ever has fewer rows than df1 , the result of merge_asof will contain duplicate values from df2 . df1每一行在df2接收其最接近的匹配, 具有替换 ,因此如果df2的行数少于df1 ,则merge_asof的结果将包含来自df2重复值。 And the result will have the same number of rows as df1 . 结果将与df1具有相同的行数。

You mentioned that you really only care to compare rows by relative position, eg, compare the 3rd value of df1 to the 3rd value of df2 . 您提到您实际上只关心通过相对位置比较行,例如,将df1的第3个值与df2的第3个值进行比较。 Instead of using merge_asof , you could simply ignore the time index once you've used it to obtain time periods of interest, and access the underlying numpy arrays with df.values : 您可以简单地忽略时间索引,而不是使用merge_asof ,一旦您使用它来获取感兴趣的时间段,并使用df.values访问底层的numpy数组:

# Get a 2D array of shape (4, 1)
df1.values
array([[51],
       [53],
       [65],
       [58]])

# Get a 1D array of shape (4,)
df1.values.flatten()
array([51, 53, 65, 58])

# numpy correlation matrix
pd.np.corrcoef(df1.values.flatten(), df2.values.flatten())
array([[1.        , 0.61287265],
       [0.61287265, 1.        ]])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM