Cross-correlation (time-lag-correlation) with pandas?

Question

I have various time series, that I want to correlate - or rather, cross-correlate - with each other, to find out at which time lag the correlation factor is the greatest.

I found various questions and answers/links discussing how to do it with numpy, but those would mean that I have to turn my dataframes into numpy arrays. And since my time series often cover different periods, I am afraid that I will run into chaos.

Edit

The issue I am having with all the numpy/scipy methods, is that they seem to lack awareness of the timeseries nature of my data. When I correlate a time series that starts in say 1940 with one that starts in 1970, pandas corr knows this, whereas np.correlate just produces a 1020 entries (length of the longer series) array full of nan.

The various Q's on this subject indicate that there should be a way to solve the different length issue, but so far, I have seen no indication on how to use it for specific time periods. I just need to shift by 12 months in increments of 1, for seeing the time of maximum correlation within one year.

Edit2

Some minimal sample data:

import pandas as pd
import numpy as np
dfdates1 = pd.date_range('01/01/1980', '01/01/2000', freq = 'MS')
dfdata1 = (np.random.random_integers(-30,30,(len(dfdates1)))/10.0) #My real data is from measurements, but random between -3 and 3 is fitting
df1 = pd.DataFrame(dfdata1, index = dfdates1)
dfdates2 = pd.date_range('03/01/1990', '02/01/2013', freq = 'MS')
dfdata2 = (np.random.random_integers(-30,30,(len(dfdates2)))/10.0)
df2 = pd.DataFrame(dfdata2, index = dfdates2)

Due to various processing steps, those dfs end up changed into df that are indexed from 1940 to 2015. this should reproduce this:

bigdates = pd.date_range('01/01/1940', '01/01/2015', freq = 'MS')
big1 = pd.DataFrame(index = bigdates)
big2 = pd.DataFrame(index = bigdates)
big1 = pd.concat([big1, df1],axis = 1)
big2 = pd.concat([big2, df2],axis = 1)

This is what I get when I correlate with pandas and shift one dataset:

In [451]: corr_coeff_0 = big1[0].corr(big2[0])
In [452]: corr_coeff_0
Out[452]: 0.030543266378853299
In [453]: big2_shift = big2.shift(1)
In [454]: corr_coeff_1 = big1[0].corr(big2_shift[0])
In [455]: corr_coeff_1
Out[455]: 0.020788314779320523

And trying scipy:

In [456]: scicorr = scipy.signal.correlate(big1,big2,mode="full")
In [457]: scicorr
Out[457]: 
array([[ nan],
       [ nan],
       [ nan],
       ..., 
       [ nan],
       [ nan],
       [ nan]])

which according to whos is

scicorr               ndarray                       1801x1: 1801 elems, type `float64`, 14408 bytes

But I'd just like to have 12 entries. /Edit2

The idea I have come up with, is to implement a time-lag-correlation myself, like so:

corr_coeff_0 = df1['Data'].corr(df2['Data'])
df1_1month = df1.shift(1)
corr_coeff_1 = df1_1month['Data'].corr(df2['Data'])
df1_6month = df1.shift(6)
corr_coeff_6 = df1_6month['Data'].corr(df2['Data'])
...and so on

But this is probably slow, and I am probably trying to reinvent the wheel here. Edit The above approach seems to work, and I have put it into a loop, to go through all 12 months of a year, but I still would prefer a built in method.

Answer 1

As far as I can tell, there isn't a built in method that does exactly what you are asking. But if you look at the source code for the pandas Series method autocorr , you can see you've got the right idea:

def autocorr(self, lag=1):
    """
    Lag-N autocorrelation

    Parameters
    ----------
    lag : int, default 1
        Number of lags to apply before performing autocorrelation.

    Returns
    -------
    autocorr : float
    """
    return self.corr(self.shift(lag))

So a simple timelagged cross covariance function would be

def crosscorr(datax, datay, lag=0):
    """ Lag-N cross correlation. 
    Parameters
    ----------
    lag : int, default 0
    datax, datay : pandas.Series objects of equal length

    Returns
    ----------
    crosscorr : float
    """
    return datax.corr(datay.shift(lag))

Then if you wanted to look at the cross correlations at each month, you could do

 xcov_monthly = [crosscorr(datax, datay, lag=i) for i in range(12)]

Answer 2

There is a better approach : You can create a function that shifted your dataframe first before calling the corr().

Get this dataframe like an example:

d = {'prcp': [0.1,0.2,0.3,0.0], 'stp': [0.0,0.1,0.2,0.3]}
df = pd.DataFrame(data=d)

>>> df
   prcp  stp
0   0.1  0.0
1   0.2  0.1
2   0.3  0.2
3   0.0  0.3

Your function to shift others columns (except the target):

def df_shifted(df, target=None, lag=0):
    if not lag and not target:
        return df       
    new = {}
    for c in df.columns:
        if c == target:
            new[c] = df[target]
        else:
            new[c] = df[c].shift(periods=lag)
    return  pd.DataFrame(data=new)

Supposing that your target is comparing the prcp (precipitation variable) with stp(atmospheric pressure)

If you do at the present will be:

>>> df.corr()
      prcp  stp
prcp   1.0 -0.2
stp   -0.2  1.0

But if you shifted 1(one) period all other columns and keep the target (prcp):

df_new = df_shifted(df, 'prcp', lag=-1)

>>> print df_new
   prcp  stp
0   0.1  0.1
1   0.2  0.2
2   0.3  0.3
3   0.0  NaN

Note that now the column stp is shift one up position at period, so if you call the corr(), will be:

>>> df_new.corr()
      prcp  stp
prcp   1.0  1.0
stp    1.0  1.0

So, you can do with lag -1, -2, -n!!

Answer 3

To build up on Andre's answer - if you only care about (lagged) correlation to the target, but want to test various lags (eg to see which lag gives the highest correlations), you can do something like this:

lagged_correlation = pd.DataFrame.from_dict(
    {x: [df[target].corr(df[x].shift(-t)) for t in range(max_lag)] for x in df.columns})

This way, each row corresponds to a different lag value, and each column corresponds to a different variable (one of them is the target itself, giving the autocorrelation).

Cross-correlation (time-lag-correlation) with pandas?

Question

3 answers

solution1
33 2016-05-13 17:17:51

solution2
2 2018-02-13 16:56:26

solution3
0 2019-04-03 08:42:12

Cross-correlation (time-lag-correlation) with pandas?

Question

3 answers

solution1 33 2016-05-13 17:17:51

solution2 2 2018-02-13 16:56:26

solution3 0 2019-04-03 08:42:12

solution1
33 2016-05-13 17:17:51

solution2
2 2018-02-13 16:56:26

solution3
0 2019-04-03 08:42:12