简体   繁体   中英

Panda dataframe groupby and regression calculations

I hope you are well.

I'm currently trying to calculate regression on some group of a data frame but without success. I'm succeeding to calculate what I want but really don't know how to reintegrate the result to my origin dataframe due to the out data structure. I try 2 functions.

I succeed for quintile and give you the code.

Sorry for the size of this message but I'm trying to be the clearest I can.

Package

import pandas as pd
from collections import OrderedDict
import statsmodels.api as sm
import numpy as np
from sklearn.linear_model import LinearRegression

Functions

def regress(data, yvar, xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept'] = 1.
    result = sm.OLS(Y, X).fit()
    y_pred = result.predict()
    residual = Y - y_pred
    return residual    
 
def Reg_func(x,y):
    # Cross Sectional Regression
    x = np.array(x).reshape((-1,1))
    y = np.array(y)
    model = LinearRegression().fit(x, y)
    y_pred = model.intercept_ + np.sum(model.coef_ * x,axis=1)
    residual = y - y_pred
 
    return residual

Dataframe Creation

ind = ['I1', 'I2', 'I3', 'I4', 'I5', 'I6', 'I7', 'I8', 'I9', 'I10', 'I11', 'I12', 'I13', 'I14', 'I15', 'I16', 'I17', 'I18', 'I19', 'I20']
Axe = ['A', 'A', 'B', 'A', 'A', 'A', 'A', 'B', 'A', 'A', 'A', 'B', 'B', 'A', 'B', 'B', 'B', 'B', 'B', 'B']
df = pd.DataFrame(np.random.randn(20, 2), index = ind, columns=['C1', 'C2'])
df.insert(0,'Axe',Axe)

If you know a better way to create it I would be greatfull:).

Calculations

# Quintile groupé par Axe
QC1 = df.groupby(['Axe'])['C1'].apply(lambda x: pd.qcut(x, 5, labels=False)+1) 
print(QC1)

QC1 respect the df structure then it's easy to integrate the result to df

# Simple regression without groupby
res_reg = Reg_func(newdf['C1'], newdf['C2'])

Res_REg is ok with df structure

# Regression per group with Reg_func fucntion
res_reg_group = (df.groupby('Axe').apply(lambda x: Reg_func(x['C1'], x['C2'])))
print(res_reg_group)

I really don't know how to reintegrate the result to df due to it structure

# Regression per group with regress function
res_reg_group2 = df.groupby('Axe').apply(regress, 'C1', ['C2'])
print(res_reg_group2)

The res_reg_group2 seem to have a better structure (keep index) but not sure to know how to combine it with my df dataframe. Moreover this function regress doesn't work for a simple regressions (without groupby).

Thanks for your help and take care

Without index

In your first case you can retrieve the residuals separately for each group eg res_reg_group['A'] .

The ordering of the residuals should be preserved (although you might want to double check), in which case you can put them into a new column based on their grouping:

res_reg_group = (df.groupby('Axe').apply(lambda x: Reg_func(x['C1'], x['C2'])))
df.loc[df['Axe']=='A', 'res'] = res_reg_group['A']
df.loc[df['Axe']=='B', 'res'] = res_reg_group['B']
print(df)

    Axe        C1        C2       res
I1    A  1.624345 -0.611756  0.545826
I2    A -0.528172 -1.072969 -0.943326
I3    B  0.865408 -2.301539 -1.889825
I4    A  1.744812 -0.761207  0.453904
I5    A  0.319039 -0.249370  0.284860
I6    A  1.462108 -2.060141 -0.980035
I7    A -0.322417 -0.384054 -0.156153
I8    B  1.133769 -1.099891 -0.656326
I9    A -0.172428 -0.877858 -0.578330
I10   A  0.042214  0.582815  0.984847
I11   A -1.100619  1.144724  1.000992
I12   B  0.901591  0.502494  0.918503
I13   B  0.900856 -0.683728 -0.267807
I14   A -0.122890 -0.935769 -0.612584
I15   B -0.267888  0.530355  0.807559
I16   B -0.691661 -0.396754 -0.169848
I17   B -0.687173 -0.845206 -0.617767
I18   B -0.671246 -0.012665  0.216664
I19   B -1.117310  0.234416  0.410802
I20   B  1.659802  0.742044  1.248044

With index

In your second case you have an index to work with, so you can just merge the two dataframes using the common index:

res_reg_group2 = df.groupby('Axe').apply(regress, 'C1', ['C2'])
output = df.merge(res_reg_group2.droplevel(0), left_index=True, right_index=True,
                  suffixes=['', '_res'])
print(output)

    Axe        C1        C2    C1_res
I1    A  1.624345 -0.611756  1.277757
I2    A -0.528172 -1.072969 -1.143578
I3    B  0.865408 -2.301539  0.403997
I4    A  1.744812 -0.761207  1.311116
I5    A  0.319039 -0.249370  0.183668
I6    A  1.462108 -2.060141  0.271328
I7    A -0.322417 -0.384054 -0.536289
I8    B  1.133769 -1.099891  0.830338
I9    A -0.172428 -0.877858 -0.674114
I10   A  0.042214  0.582815  0.391883
I11   A -1.100619  1.144724 -0.423441
I12   B  0.901591  0.502494  0.808824
I13   B  0.900856 -0.683728  0.652137
I14   A -0.122890 -0.935769 -0.658330
I15   B -0.267888  0.530355 -0.356992
I16   B -0.691661 -0.396754 -0.902651
I17   B -0.687173 -0.845206 -0.957121
I18   B -0.671246 -0.012665 -0.831740
I19   B -1.117310  0.234416 -1.245321
I20   B  1.659802  0.742044  1.598529

I'm not sure why the residual values are different, maybe some differences between statsmodels and sklearn, but thats how you combine the results anyway

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM