按两列分组的 Pandas 回归

Question

What I'm going to do我要做什么

I'd like to get average stock price, regression coefficient and R-square of stock prices in float by stock item, eg Apple, Amazon, etc., and certain date period, eg Feb. 15 ~ Mar.14.我想按股票项目（例如Apple，Amazon 等）和特定日期时间段（例如2 月15 日~3 月14 日）获取浮动股票价格的平均股价、回归系数和R 平方。 as a part of quantitative investment simulation encompassing 30 years.作为涵盖 30 年的量化投资模拟的一部分。 The problem is that it simply is too slow.问题是它太慢了。 At first, I made the whole code with PostgreSQL but it was too slow - didn't finish after 2 hours.起初，我用 PostgreSQL 编写了整个代码，但速度太慢了 - 2 小时后没有完成。 After asking a professor friend in management information system, I'm trying pandas for the first time.问了管理信息系统的一位教授朋友后，我第一次尝试pandas。

The data structure implemented so far look like this:到目前为止实现的数据结构如下所示：

Raw data (Dataframe named dfStock)原始数据（名为 dfStock 的数据框）
────────────────────────────────────────── ────────────────────────────────────────────
Code |代码 | Date |日期 | Date Group |日期组 | Price |价格 |
────────────────────────────────────────── ────────────────────────────────────────────
AAPL |苹果| 20200205 | 20200205 | 20200205 | 20200205 | ###.## | ###.## |
AAPL |苹果| 20200206 | 20200206 | 20200305 | 20200305 | ###.## | ###.## |
... ...
AAPL |苹果| 20200305 | 20200305 | 20200305 | 20200305 | ###.## | ###.## |
AAPL |苹果| 20200306 | 20200306 | 20200405 | 20200405 | ###.## | ###.## |
... ...
────────────────────────────────────────── ────────────────────────────────────────────
Results (Dataframe named dfSumS)结果（名为 dfSumS 的数据框）
────────────────────────────────────────── ────────────────────────────────────────────
Code |代码 | Date group |日期组 | Avg.平均Price |价格 | Slope |坡度 | R-Square R平方
────────────────────────────────────────── ────────────────────────────────────────────
AAPL |苹果| 20200205 | 20200205 | ###.## | ###.## | #.## | #.## | #.## #.##
AMZN |亚马逊 | 20200205 | 20200205 | ###.## | ###.## | #.## | #.## | #.## #.##
... ...
AAPL |苹果| 20200305 | 20200305 | ###.## | ###.## | #.## | #.## | #.## #.##
AMZN |亚马逊 | 20200305 | 20200305 | ###.## | ###.## | #.## | #.## | #.## #.##
... ...
────────────────────────────────────────── ────────────────────────────────────────────

Code As of Now代码截至现在

'prevdt' corresponds to 'Date Group' in the above and 'compcd' means company code 'prevdt'对应上面的'Date Group'，'compcd'表示公司代码

from sklearn.linear_model import LinearRegression

# Method Tried 1    
model = LinearRegression()   
def getRegrS(arg_cd, arg_prevdt):
    x = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['rnk'].to_numpy().reshape((-1,1))
    y = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['adjenp'].to_numpy()
    model.fit(x, y)
    return model.coef_[0], model.score(x,y)

# Method Tried 2
def getRegrS(arg_cd, arg_prevdt):
    x = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['rnk'].to_numpy()
    y = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['adjenp'].to_numpy()
    rv = stats.linregress(x,y)
    return rv[0], rv[2]
    
dfSumS['rnk'] = dfStock.groupby(['compcd','prevdt']).cumcount()+1
dfSumS[['slope','rsq']]= [getRegrS(cd, prevdt) for cd, prevdt in zip(dfSumS['compcd'], dfSumS['prevdt'])]

What I've tried before我以前尝试过的

Based on recommendation in this link , I tried vectoriztion, but got the message "Can only compare identically-labeled Series objects".根据此链接中的建议，我尝试了矢量化，但收到消息“只能比较标记相同的系列对象”。 Unable to solve this problem, I came to two functions in the above, which were not fast enough.没办法解决这个问题，就来了上面的两个函数，速度都不够快。 Both worked with a smaller set of code like the year of 2020, but once the data period became as large as 2~3 decades, it took hours.两者都使用较小的一组代码，例如 2020 年，但是一旦数据周期变得大到 2 到 3 个十年，就需要几个小时。

I thought of apply, iterrows, etc., but didn't because firstly the link says it's slower than I've done and secondly each of these seem to apply to only one column while I have to two results - coefficient and R-square over the same period so that calling them twice definitely will be slower.我想到了申请、iterrows 等，但没有，因为首先链接说它比我做的慢，其次，这些似乎只适用于一列，而我必须得到两个结果 - 系数和 R 方在同一时期内调用它们两次肯定会更慢。

Now I'm trying the pool thing mentioned in a few posts现在我正在尝试一些帖子中提到的游泳池

Answer 1

I'm afraid that if you're trying to run thousands of large linear regressions, then you will have to pay the price in time spent running.恐怕如果您试图运行数千个大型线性回归，那么您将不得不为运行所花费的时间付出代价。 If you are only interested in the beta coefficient or the r2 value, it could be more efficient to calculate them separately with numpy as (XtX)^(-1)Xty and cov(X,y)/sqrt(var(X)var(y)) respectively.如果您只对 beta 系数或 r2 值感兴趣，那么使用numpy作为(XtX)^(-1)Xty和cov(X,y)/sqrt(var(X)var(y))分别计算它们可能更有效cov(X,y)/sqrt(var(X)var(y))分别。

按两列分组的 Pandas 回归

问题描述

1 个解决方案

解决方案1
0 2020-10-04 12:54:29

按两列分组的 Pandas 回归

问题描述

1 个解决方案

解决方案1 0 2020-10-04 12:54:29

解决方案1
0 2020-10-04 12:54:29