I have built a model with linear regression
and I would like to calculate r2
score based on the output I have. However the result is really unexpected:
As you can see below, the pearson
correlation between y
and y hat
is positive which means the r2
score should be at least positive. However the result I got from sklearn
is negative. How come? Thanks in advance!
import numpy as np
from sklearn.metrics import r2_score
from scipy.stats import pearsonr
y = np.array([ 5.2 , 1.144 , 3.3 , 5.59741373, 1.438 , 7.562 , 2.7 , 0.22706035, 2.204 , 2.396 ,
4.314 , 12.51420331, 10.8 , 10.638 , 5.101 ,
3.861 , 3.2 , 3.8 , 7.072 , -0.4597798 ,
-0.9 , 0.3 , -3.54 , -0.4 , -3. ,
0.7 , 1.3 , 1.5 , 6. , 2.8 ,
2. , 3.122 ])
y_hat = np.array([ 1.25131326, 2.64864629, 1.56201996, 4.26699994, 2.21499358,
0.59113701, 2.40848854, 0.14954989, 0.45800824, 2.82399621,
2.48736001, 2.78476975, 1.36378354, 3.4889863 , 2.4226333 ,
2.63939523, 4.15008518, 2.61525276, 2.29859288, -1.4358969 ,
-3.67752652, -3.73173215, -2.67027158, 0.35012302, 3.91349371,
5.11971861, 5.96586311, 3.36520449, 0.5204047 , 1.584193 ,
-0.05781178, 1.75957967])
pearsonr(y, y_hat) # This gives around 0.299
r2_score(y, y_hat) # This gives -0.18478241562914666
I think I know what is going on here. Basically I naively thought positive correlation would lead to positive r square but this is not the case. By calculating the mean square error of y_hat vs y and y_avg vs y I realize that y_hat is indeed a worse estimator compared to always just predicting the average.
http://www.fairlynerdy.com/what-is-r-squared/
Take a look at this graph from the link above and you can see that even if two series are moving in the same direction, the distance caused by the intercept would make the performance measured by MSE really bad
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.