简体   繁体   中英

Python - Multiple Linear Regression - Coefficient of Determination for each Input Variable

I am performing a fairly straight forward multiple linear regression in Python using sklearn. See code snippet below - full_results is a dataframe in which all variables are numeric.

The results of this code is a single coefficient of determination which I believe denotes how much change in y is due to the combination of x1 - x4.

My question is whether the coefficient of determination can be split out between the 4 input variables, so I can see how much change in y is attributed to each variable individually.

I can of course run a single variable linear regression for each variable independently, but this doesn't feel like the right solution.

I have a memory of being in stats class many years ago and doing something similar in R.

from sklearn.linear_model import LinearRegression

x = full_results[['x1','x2','x3','x4']].values
y = full_results['y'].values

mlr = LinearRegression()
mlr.fit(x, y)

mlr.score(x, y)

The coefficient of determination is the proportion of total variance explained. So another way of looking at it is to see the proportion of variance explained by each term, also explained here . For this we use an anova to calculate the sum of squares for each term.

One thing you have to take note is that this works if your predictors are not correlated. If they are, then the order in each they are specified in the model would make a difference in the calculation.

Using an example dataset:

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
import pandas as pd

X,y = make_regression(n_samples=100, n_features=4,
n_informative=3, noise=20, random_state=99)

df = pd.DataFrame(X,columns = ['x1','x2','x3','x4'])
df['y'] = y

mlr = LinearRegression()
mlr.fit(df[['x1','x2','x3','x4']], y)

mlr.coef_
array([ 8.33369861, 29.1717497 , 26.6294007 , -1.82445836])

mlr.score(df[['x1','x2','x3','x4']], y)

0.8465893941639528

It's easier to calculate this with statsmodels and make a linear fit, you can see the coefficients will be pretty similar:

import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

lm = ols('y ~ x1 + x2 + x3 + x4',df).fit()
lm.params

Intercept    -0.740399
x1            8.333699
x2           29.171750
x3           26.629401
x4           -1.824458

We get the anova:

anova_table = anova_lm(lm)
anova_table

            df         sum_sq        mean_sq           F        PR(>F)
x1         1.0   10394.554366   10394.554366   28.605241  6.110239e-07
x2         1.0  113541.846572  113541.846572  312.460911  8.531356e-32
x3         1.0   66267.787822   66267.787822  182.365304  7.899193e-24
x4         1.0     298.584632     298.584632    0.821688  3.669804e-01
Residual  95.0   34521.039456     363.379363         NaN           NaN

Everything except the residuals in sum square column gives you r-squared similar to that from sklearn:

anova_table['sum_sq'][:-1].sum() / anova_table['sum_sq'].sum()
0.8465893941639528

Now the proportion of variance explained (we seldom call it r-squared) for example 'x1' is:

anova_table.loc['x1','sum_sq'] / anova_table['sum_sq'].sum()
0.046193130558342954

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM