沿着每列計算 Pandas DataFrame 的自相關

Question

我想計算 Pandas DataFrame 的列中滯后長度之一的自相關系數。我的數據片段是：

            RF        PC         C         D        PN        DN         P
year                                                                      
1890       NaN       NaN       NaN       NaN       NaN       NaN       NaN
1891 -0.028470 -0.052632  0.042254  0.081818 -0.045541  0.047619 -0.016974
1892 -0.249084  0.000000  0.027027  0.067227  0.099404  0.045455  0.122337
1893  0.653659  0.000000  0.000000  0.039370 -0.135624  0.043478 -0.142062

沿着year ，我想計算每列（ RF ， PC等）的滯后一個的自相關。

為了計算自相關，我為開始和結束數據相差一年的每一列提取了兩個時間序列，然后使用numpy.corrcoef計算相關系數。

例如，我寫道：

numpy.corrcoef(data[['C']][1:-1],data[['C']][2:])

（整個 DataFrame 稱為data ）。
然而，該命令不幸返回：

array([[ nan,  nan,  nan, ...,  nan,  nan,  nan],
       [ nan,  nan,  nan, ...,  nan,  nan,  nan],
       [ nan,  nan,  nan, ...,  nan,  nan,  nan],
       ..., 
       [ nan,  nan,  nan, ...,  nan,  nan,  nan],
       [ nan,  nan,  nan, ...,  nan,  nan,  nan],
       [ nan,  nan,  nan, ...,  nan,  nan,  nan]])

有人可以建議我如何計算自相關嗎？

Answer 1

這是一個遲到的答案，但對於未來的用戶，您還可以使用pandas.Series.autocorr（），它計算Series上的lag-N（默認值= 1）自相關：

df['C'].autocorr(lag=1)

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.autocorr.html#pandas.Series.autocorr

Answer 2

.autocorr 適用於Series，而不適用於DataFrames。 您可以使用.apply應用於DataFrame：

def df_autocorr(df, lag=1, axis=0):
    """Compute full-sample column-wise autocorrelation for a DataFrame."""
    return df.apply(lambda col: col.autocorr(lag), axis=axis)
d1 = DataFrame(np.random.randn(100, 6))

df_autocorr(d1)
Out[32]: 
0    0.141
1   -0.028
2   -0.031
3    0.114
4   -0.121
5    0.060
dtype: float64

您還可以使用指定窗口計算滾動自相關，如下所示（這是.autocorr在幕后進行的操作）：

def df_rolling_autocorr(df, window, lag=1):
    """Compute rolling column-wise autocorrelation for a DataFrame."""

    return (df.rolling(window=window)
        .corr(df.shift(lag))) # could .dropna() here

df_rolling_autocorr(d1, window=21).dropna().head()
Out[38]: 
        0      1      2      3      4      5
21 -0.173 -0.367  0.142 -0.044 -0.080  0.012
22  0.015 -0.341  0.250 -0.036  0.023 -0.012
23  0.038 -0.329  0.279 -0.026  0.075 -0.121
24 -0.025 -0.361  0.319  0.117  0.031 -0.120
25  0.119 -0.320  0.181 -0.011  0.038 -0.111

Answer 3

你應該使用：

numpy.corrcoef(df['C'][1:-1], df['C'][2:])

df[['C']]表示只有一列的數據幀，而df['C']是包含C列中值的系列。

Answer 4

因為我相信我們需要對應於最高相關性的 window 的用例更為常見，所以我添加了另一個 function，它返回每個特征的 window 長度。

# Find autocorrelation example.
def df_autocorr(df, lag=1, axis=0):
    """Compute full-sample column-wise autocorrelation for a DataFrame."""
    return df.apply(lambda col: col.autocorr(lag), axis=axis)

def df_rolling_autocorr(df, window, lag=1):
    """Compute rolling column-wise autocorrelation for a DataFrame."""

    return (df.rolling(window=window)
        .corr(df.shift(lag))) # could .dropna() here

def df_autocorr_highest(df, window_min, window_max, lag_f):
    """Returns a dictionary containing highest correlation coefficient wrt window length."""
    df_corrs = pd.DataFrame()
    df_corr_dict = {}
    for i in range(len(df.columns)):
        corr_init = 0
        corr_index = 0
        for j in range(window_min, window_max): 
            corr = df_rolling_autocorr(df.iloc[:,i], window=j, lag=lag_f).dropna().mean()
            if corr > corr_init:
                corr_init = corr
                corr_index = j
        corr_label = df.columns[i] + "_corr"    
        df_corr_dict[corr_label] = [corr_init, corr_index]
    return df_corr_dict

沿着每列計算 Pandas DataFrame 的自相關

問題描述

4 個解決方案

解決方案1
19 2014-11-27 06:32:31

解決方案2
7 2017-05-02 23:14:25

解決方案3
3 已采納 2014-09-28 09:41:45

解決方案4
0 2022-11-26 15:14:22

沿着每列計算 Pandas DataFrame 的自相關

問題描述

4 個解決方案

解決方案1 19 2014-11-27 06:32:31

解決方案2 7 2017-05-02 23:14:25

解決方案3 3 已采納 2014-09-28 09:41:45

解決方案4 0 2022-11-26 15:14:22

解決方案1
19 2014-11-27 06:32:31

解決方案2
7 2017-05-02 23:14:25

解決方案3
3 已采納 2014-09-28 09:41:45

解決方案4
0 2022-11-26 15:14:22