在大型DataFrame上滚动线性回归

Question

I have two huge dataframes df_y and df_x . 我有两个巨大的数据帧df_y和df_x 。
df_y has columns ['date','ids','Y'] . df_y具有列['date','ids','Y'] 。 Basically each 'ids' has data for all the 'date' . 基本上每个'ids'都有所有'date' 。
df_x has columns ['date','X1','X2','X3','X4','X5','X6'] . df_x具有列['date','X1','X2','X3','X4','X5','X6'] 。
df_x has all the date that are in df_y . df_x拥有所有的date是在df_y 。 However some ids might have shorter period, ie, either starting from a late date or ending at an early date . 但是，某些ids周期可能较短，即从较晚的date或在较早的date结束。
I want to run a rolling linear regression (OLS) Id ~ X1 + X2 + X3 + X4 + X5 + X6 + intercept for each 'ids' in df_y with a lookback of 200 days. 我想对df_y每个'ids'运行滚动线性回归（OLS） df_y Id ~ X1 + X2 + X3 + X4 + X5 + X6 + intercept ，回溯200天。

Sample dataframes: 样本数据框：

import string, random, pandas as pd, numpy as np
ids = [''.join(random.choice(string.ascii_uppercase) for _ in range(3)) for _ in range(200)]
dates = pd.date_range('2000-01-01', '2017-07-02')
df_dates = pd.DataFrame({'date':dates, 'joinC':len(dates)*[2]})
df_ids = pd.DataFrame({'ids':ids, 'joinC':len(ids)*[2]})
df_values = pd.DataFrame({'Y':np.random.normal(size = 
len(dates)*len(ids))})
df_y = df_dates.merge(df_ids, on='joinC', how="outer")
df_y = df_y[['date', 'ids']].merge(df_values, left_index=True, 
right_index=True, how="inner")
df_y = df_y.sort_values(['date', 'ids'], ascending=[True, True])
df_x = pd.DataFrame({'date':dates, 'X1':np.random.normal(size = len(dates)), 'X2':np.random.normal(size = len(dates)), 'X3':np.random.normal(size = len(dates)), 'X4':np.random.normal(size = len(dates)), 'X5':np.random.normal(size = len(dates)), 'X6':np.random.normal(size = len(dates))})

My attempt: 我的尝试：

import statsmodels.api as sm
dates = list(df_y['date'].unique())
ids = list(df_y['ids'].unique())
for i in range(200, len(dates) +1):
  for id in ids:
    s_date = dates[i - 200]
    e_date = dates[i - 1]
    Y = df_y[(df_y['date'] >= s_date) & (df_y['date'] <= e_date) & (df_y['ids'] == id)]['Y']
    Y = Y.reset_index()['Y']
    X = df_x[(df_x['date'] >= s_date) & (df_x['date'] <= e_date)]
    X = X.reset_index()[['X1','X2','X3','X4','X5','X6']]
    X = sm.add_constant(X)
    if len(X) <> len(Y):
      continue
    regr = sm.OLS(Y, X).fit()  #Hangs here after 2 years.
    X_pr = X.tail(1)
    Y_hat = regr.predict(X_pr)
    Y.loc[(df_y['date'] == e_date) & (df_y['ids'] == id), 'Y_hat'] = Y_hat.tolist()[0]

My attempt above seems to be working fine up until the point where it hangs (most likely at fitting step) after running for approx. 我上面的尝试似乎可以正常运行，直到运行约20分钟后挂起（最有可能在装配步骤）。 2 years. 2年。 I am inclined to use statsmodels since it supports regularization (planning for future work). 我倾向于使用statsmodels因为它支持正则化（规划未来的工作）。 However, if using other library makes it faster or more elegant then I am fine with it too. 但是，如果使用其他库使它更快或更优雅，那么我也很满意。 Could someone please help define the fastest solution that doesn't hang midway. 有人可以帮忙定义不会挂在中间的最快的解决方案。 Thanks a lot. 非常感谢。

Answer 1

I was able to get this workaround using Pandas MovingOLS 我能够使用Pandas MovingOLS获得此解决方法

import pandas as pd
dates = list(df_y['date'].unique())
ids = list(df_y['ids'].unique())
Y_hats = []
for id in ids:
    Y = df_y[(df_y['ids'] == id)][['date', 'ids', 'Y']]
    Y = Y.merge(df_x, how='left', on=['date'])
    X_cols = list(df_x.columns).remove['date']
    model = pd.stats.ols.MovingOLS(y=Y['Y'], x=Y[X_cols], window_type='rolling', window=250, intercept=True)
    Y['intercept'] = 1
    betas = model.beta
    betas = betas.multiply(Y[betas.columns], axis='index')
    betas = betas.sum(axis=1)
    betas = betas[betas > 0]
    betas = betas.to_frame()
    betas.columns = [['Y_hat']]
    betas = betas.merge(Y[['date', 'ids']], how='left', left_index=True, right_index=True)
    Y_hats.append(betas)
Y_hats = pd.concat(Y_hats)
Y = Y.merge(Y_hats[['date', 'ids', 'Y_hat'], how='left', on=['date', 'ids']]

There is a straightforward way to use Y['Y_hat'] = model.y_predict if lets say one wants to fit Y ~ X on (y_1, y_2, ... y_n) and (x_1, x_2, ... x_n) but only wants to predict Y_(n+1) using X_(n+1) . 还有就是用一个简单的方法Y['Y_hat'] = model.y_predict如果让我们说一个人想以适应Y ~ X上(y_1, y_2, ... y_n)和(x_1, x_2, ... x_n)但只想使用X_(n+1)来预测Y_(n+1) X_(n+1) 。

在大型DataFrame上滚动线性回归

问题描述

1 个解决方案

解决方案1
0 2018-02-18 16:22:08

在大型DataFrame上滚动线性回归

问题描述

1 个解决方案

解决方案1 0 2018-02-18 16:22:08

解决方案1
0 2018-02-18 16:22:08