使用Python Pandas的累积OLS

Question

I am using Pandas 0.8.1, and at the moment I can't change the version. 我正在使用Pandas 0.8.1，目前我无法更改版本。 If a newer version will help the problem below, please note it in a comment rather than an answer. 如果新版本有助于解决以下问题，请在评论中注明，而不是回答。 Also, this is for a research replication project, so even though re-running a regression after appending only one new data point might be silly (if the data set is large), I still have to do it. 此外，这是一个研究复制项目，所以即使在仅附加一个新数据点后重新运行回归可能是愚蠢的（如果数据集很大），我仍然必须这样做。 Thanks! 谢谢！

In Pandas, there is a rolling option for the window_type argument to pandas.ols but it seems implicit that this requires some choice of a window size or use of the whole data sample as default. 在熊猫，还有一个rolling的选项window_type参数pandas.ols但似乎暗示，这需要一个窗口大小或使用整个数据样本为默认的一些选择。 I'm looking to instead use all the data in a cumulative fashion. 我希望以累积的方式使用所有数据。

I am trying to run a regression on a pandas.DataFrame that is sorted by date. 我正在尝试对按日期排序的pandas.DataFrame运行回归。 For each index i , I want to run a regression using the data available from the minimum date up through the date at index i . 对于每个索引i ，我想使用从索引i的最小日期到日期的可用数据运行回归。 So the window effectively grows by one on every iteration, all data is cumulatively used from the earliest observation, and no data is ever dropped out of the window. 因此，窗口在每次迭代时有效地增加一个，所有数据从最早的观察中累积使用，并且没有数据从窗口中丢失。

I have written a function (below) that works with apply to perform this, but it is unacceptably slow. 我已经编写了一个函数（下面），可以使用apply来执行此操作，但它的速度慢得令人无法接受。 Instead, is there a way to use pandas.ols to directly perform this sort of cumulative regression? 相反，有没有办法使用pandas.ols直接执行这种累积回归？

Here are some more specifics about my data. 以下是有关我的数据的更多细节。 I have a pandas.DataFrame containing a column of identifier, a column of dates, a column of left-hand-side values, and a column of right-hand-side values. 我有一个pandas.DataFrame其中包含一列标识符，一列日期，一列左侧值和一列右侧值。 I want to use groupby to group based on the identifier, and then perform a cumulative regression for every time period consisting of the left-hand and right-hand-side variables. 我想使用groupby基于标识符进行分组，然后对包含左侧和右侧变量的每个时间段执行累积回归。

Here is the function I am able to use with apply on the identifier-grouped object: 这是我能够使用的功能， apply标识符分组对象：

def cumulative_ols(
                   data_frame, 
                   lhs_column, 
                   rhs_column, 
                   date_column,
                   min_obs=60
                  ):

    beta_dict = {}
    for dt in data_frame[date_column].unique():
        cur_df = data_frame[data_frame[date_column] <= dt]
        obs_count = cur_df[lhs_column].notnull().sum()

        if min_obs <= obs_count:
            beta = pandas.ols(
                              y=cur_df[lhs_column],
                              x=cur_df[rhs_column],
                             ).beta.ix['x']
            ###
        else:
            beta = np.NaN
        ###
        beta_dict[dt] = beta
    ###

    beta_df = pandas.DataFrame(pandas.Series(beta_dict, name="FactorBeta"))
    beta_df.index.name = date_column
    return beta_df

Answer 1

Following on the advice in the comments, I created my own function that can be used with apply and which relies on cumsum to accumulate all the individual needed terms for expressing the coefficient from an OLS univariate regression vectorially. 根据评论中的建议，我创建了自己的函数，可以与apply一起使用，并且依赖于cumsum来积累所有单独需要的术语，用于从OLS单变量回归中表达系数。

def cumulative_ols(
                   data_frame,
                   lhs_column,
                   rhs_column,
                   date_column,
                   min_obs=60,
                  ):
    """
    Function to perform a cumulative OLS on a Pandas data frame. It is
    meant to be used with `apply` after grouping the data frame by categories
    and sorting by date, so that the regression below applies to the time
    series of a single category's data and the use of `cumsum` will work    
    appropriately given sorted dates. It is also assumed that the date 
    conventions of the left-hand-side and right-hand-side variables have been 
    arranged by the user to match up with any lagging conventions needed.

    This OLS is implicitly univariate and relies on the simplification to the
    formula:

    Cov(x,y) ~ (1/n)*sum(x*y) - (1/n)*sum(x)*(1/n)*sum(y)
    Var(x)   ~ (1/n)*sum(x^2) - ((1/n)*sum(x))^2
    beta     ~ Cov(x,y) / Var(x)

    and the code makes a further simplification be cancelling one factor 
    of (1/n).

    Notes: one easy improvement is to change the date column to a generic sort
    column since there's no special reason the regressions need to be time-
    series specific.
    """
    data_frame["xy"]         = (data_frame[lhs_column] * data_frame[rhs_column]).fillna(0.0)
    data_frame["x2"]         = (data_frame[rhs_column]**2).fillna(0.0)
    data_frame["yobs"]       = data_frame[lhs_column].notnull().map(int)
    data_frame["xobs"]       = data_frame[rhs_column].notnull().map(int)
    data_frame["cum_yobs"]   = data_frame["yobs"].cumsum()
    data_frame["cum_xobs"]   = data_frame["xobs"].cumsum()
    data_frame["cumsum_xy"]  = data_frame["xy"].cumsum()
    data_frame["cumsum_x2"]  = data_frame["x2"].cumsum()
    data_frame["cumsum_x"]   = data_frame[rhs_column].fillna(0.0).cumsum()
    data_frame["cumsum_y"]   = data_frame[lhs_column].fillna(0.0).cumsum()
    data_frame["cum_cov"]    = data_frame["cumsum_xy"] - (1.0/data_frame["cum_yobs"])*data_frame["cumsum_x"]*data_frame["cumsum_y"]
    data_frame["cum_x_var"]  = data_frame["cumsum_x2"] - (1.0/data_frame["cum_xobs"])*(data_frame["cumsum_x"])**2
    data_frame["FactorBeta"] = data_frame["cum_cov"]/data_frame["cum_x_var"]
    data_frame["FactorBeta"][data_frame["cum_yobs"] < min_obs] = np.NaN
    return data_frame[[date_column, "FactorBeta"]].set_index(date_column)
### End cumulative_ols

I have verified on numerous test cases that this matches the output of my former function and the output of NumPy's linalg.lstsq function. 我已经在许多测试用例中验证了这与我以前的函数的输出和NumPy的linalg.lstsq函数的输出相匹配。 I haven't done a full benchmark on the timing, but anecdotally, it is around 50 times faster in the cases I've been working on. 我没有对时间进行完整的基准测试，但有趣的是，在我一直在努力的情况下它快了大约50倍。

使用Python Pandas的累积OLS

问题描述

1 个解决方案

解决方案1
0 已采纳 2013-02-27 13:43:45

使用Python Pandas的累积OLS

问题描述

1 个解决方案

解决方案1 0 已采纳 2013-02-27 13:43:45

解决方案1
0 已采纳 2013-02-27 13:43:45