I am performing a fairly straight forward multiple linear regression in Python using sklearn. See code snippet below - full_results is a dataframe in which all variables are numeric.
The results of this code is a single coefficient of determination which I believe denotes how much change in y is due to the combination of x1 - x4.
My question is whether the coefficient of determination can be split out between the 4 input variables, so I can see how much change in y is attributed to each variable individually.
I can of course run a single variable linear regression for each variable independently, but this doesn't feel like the right solution.
I have a memory of being in stats class many years ago and doing something similar in R.
from sklearn.linear_model import LinearRegression
x = full_results[['x1','x2','x3','x4']].values
y = full_results['y'].values
mlr = LinearRegression()
mlr.fit(x, y)
mlr.score(x, y)
The coefficient of determination is the proportion of total variance explained. So another way of looking at it is to see the proportion of variance explained by each term, also explained here . For this we use an anova to calculate the sum of squares for each term.
One thing you have to take note is that this works if your predictors are not correlated. If they are, then the order in each they are specified in the model would make a difference in the calculation.
Using an example dataset:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
import pandas as pd
X,y = make_regression(n_samples=100, n_features=4,
n_informative=3, noise=20, random_state=99)
df = pd.DataFrame(X,columns = ['x1','x2','x3','x4'])
df['y'] = y
mlr = LinearRegression()
mlr.fit(df[['x1','x2','x3','x4']], y)
mlr.coef_
array([ 8.33369861, 29.1717497 , 26.6294007 , -1.82445836])
mlr.score(df[['x1','x2','x3','x4']], y)
0.8465893941639528
It's easier to calculate this with statsmodels and make a linear fit, you can see the coefficients will be pretty similar:
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
lm = ols('y ~ x1 + x2 + x3 + x4',df).fit()
lm.params
Intercept -0.740399
x1 8.333699
x2 29.171750
x3 26.629401
x4 -1.824458
We get the anova:
anova_table = anova_lm(lm)
anova_table
df sum_sq mean_sq F PR(>F)
x1 1.0 10394.554366 10394.554366 28.605241 6.110239e-07
x2 1.0 113541.846572 113541.846572 312.460911 8.531356e-32
x3 1.0 66267.787822 66267.787822 182.365304 7.899193e-24
x4 1.0 298.584632 298.584632 0.821688 3.669804e-01
Residual 95.0 34521.039456 363.379363 NaN NaN
Everything except the residuals in sum square column gives you r-squared similar to that from sklearn:
anova_table['sum_sq'][:-1].sum() / anova_table['sum_sq'].sum()
0.8465893941639528
Now the proportion of variance explained (we seldom call it r-squared) for example 'x1' is:
anova_table.loc['x1','sum_sq'] / anova_table['sum_sq'].sum()
0.046193130558342954
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.