简体   繁体   中英

Pandas Regression by group of two columns

What I'm going to do

I'd like to get average stock price, regression coefficient and R-square of stock prices in float by stock item, eg Apple, Amazon, etc., and certain date period, eg Feb. 15 ~ Mar.14. as a part of quantitative investment simulation encompassing 30 years. The problem is that it simply is too slow. At first, I made the whole code with PostgreSQL but it was too slow - didn't finish after 2 hours. After asking a professor friend in management information system, I'm trying pandas for the first time.


The data structure implemented so far look like this:

  1. Raw data (Dataframe named dfStock)
    ──────────────────────────────────────────
    Code | Date | Date Group | Price |
    ──────────────────────────────────────────
    AAPL | 20200205 | 20200205 | ###.## |
    AAPL | 20200206 | 20200305 | ###.## |
    ...
    AAPL | 20200305 | 20200305 | ###.## |
    AAPL | 20200306 | 20200405 | ###.## |
    ...
    ──────────────────────────────────────────

  2. Results (Dataframe named dfSumS)
    ──────────────────────────────────────────
    Code | Date group | Avg.Price | Slope | R-Square
    ──────────────────────────────────────────
    AAPL | 20200205 | ###.## | #.## | #.##
    AMZN | 20200205 | ###.## | #.## | #.##
    ...
    AAPL | 20200305 | ###.## | #.## | #.##
    AMZN | 20200305 | ###.## | #.## | #.##
    ...
    ──────────────────────────────────────────


Code As of Now

'prevdt' corresponds to 'Date Group' in the above and 'compcd' means company code

from sklearn.linear_model import LinearRegression

# Method Tried 1    
model = LinearRegression()   
def getRegrS(arg_cd, arg_prevdt):
    x = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['rnk'].to_numpy().reshape((-1,1))
    y = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['adjenp'].to_numpy()
    model.fit(x, y)
    return model.coef_[0], model.score(x,y)

# Method Tried 2
def getRegrS(arg_cd, arg_prevdt):
    x = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['rnk'].to_numpy()
    y = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['adjenp'].to_numpy()
    rv = stats.linregress(x,y)
    return rv[0], rv[2]
    
dfSumS['rnk'] = dfStock.groupby(['compcd','prevdt']).cumcount()+1
dfSumS[['slope','rsq']]= [getRegrS(cd, prevdt) for cd, prevdt in zip(dfSumS['compcd'], dfSumS['prevdt'])]

What I've tried before

Based on recommendation in this link , I tried vectoriztion, but got the message "Can only compare identically-labeled Series objects". Unable to solve this problem, I came to two functions in the above, which were not fast enough. Both worked with a smaller set of code like the year of 2020, but once the data period became as large as 2~3 decades, it took hours.

I thought of apply, iterrows, etc., but didn't because firstly the link says it's slower than I've done and secondly each of these seem to apply to only one column while I have to two results - coefficient and R-square over the same period so that calling them twice definitely will be slower.

Now I'm trying the pool thing mentioned in a few posts

I'm afraid that if you're trying to run thousands of large linear regressions, then you will have to pay the price in time spent running. If you are only interested in the beta coefficient or the r2 value, it could be more efficient to calculate them separately with numpy as (XtX)^(-1)Xty and cov(X,y)/sqrt(var(X)var(y)) respectively.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM