简体   繁体   English

线性回归中的负精度

[英]Negative accuracy in linear regression

My linear regression model has negative coefficient of determination R².我的线性回归模型具有决定系数 R²。

How can this happen?这怎么会发生? Any idea is helpful.任何想法都有帮助。

Here is my dataset:这是我的数据集:

year,population
1960,22151278.0
1961,22671191.0
1962,23221389.0
1963,23798430.0
1964,24397022.0
1965,25013626.0
1966,25641044.0
1967,26280132.0
1968,26944390.0
1969,27652709.0
1970,28415077.0
1971,29248643.0
1972,30140804.0
1973,31036662.0
1974,31861352.0
1975,32566854.0
1976,33128149.0
1977,33577242.0
1978,33993301.0
1979,34487799.0
1980,35141712.0
1981,35984528.0
1982,36995248.0
1983,38142674.0
1984,39374348.0
1985,40652141.0
1986,41965693.0
1987,43329231.0
1988,44757203.0
1989,46272299.0
1990,47887865.0
1991,49609969.0
1992,51423585.0
1993,53295566.0
1994,55180998.0
1995,57047908.0
1996,58883530.0
1997,60697443.0
1998,62507724.0
1999,64343013.0
2000,66224804.0
2001,68159423.0
2002,70142091.0
2003,72170584.0
2004,74239505.0
2005,76346311.0
2006,78489206.0
2007,80674348.0
2008,82916235.0
2009,85233913.0
2010,87639964.0
2011,90139927.0
2012,92726971.0
2013,95385785.0
2014,98094253.0
2015,100835458.0
2016,103603501.0
2017,106400024.0
2018,109224559.0

The code of the LinearRegression model is as follows: LinearRegression模型的代码如下:

import pandas as pd

from sklearn.linear_model import LinearRegression

data =pd.read_csv("data.csv", header=None )

data = data.drop(0,axis=0)

X=data[0]

Y=data[1]

from sklearn.model_selection import train_test_split 

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1,shuffle =False)

lm = LinearRegression()

lm.fit(X_train.values.reshape(-1,1), Y_train.values.reshape(-1,1))

Y_pred = lm.predict(X_test.values.reshape(-1,1))

accuracy = lm.score(Y_test.values.reshape(-1,1),Y_pred)

print(accuracy)
output
-3592622948027972.5

Here is the formula of the R² score:这是R²分数的公式:

R2公式

\\hat{y_i} is the predictor of the i-th observation y_i and \\bar{y} is the mean of all observations. \\hat{y_i} 是第 i 个观察值 y_i 的预测值,而 \\bar{y} 是所有观察值的平均值。

Therefore, a negative R² means that if someone knew the mean of your y_test sample and always used it as a "prediction", this "prediction" would be more accurate than your model.因此,负 R² 意味着如果有人知道您的y_test样本的平均值并始终将其用作“预测”,则此“预测”将比您的模型更准确。

Moving on to your dataset (thanks to @Prayson W. Daniel for the convenient loading script), let us have a quick look at your data.转到您的数据集(感谢@Prayson W. Daniel 提供方便的加载脚本),让我们快速查看您的数据。

df.population.plot()

人口

It looks like a logarithmic transformation could help.看起来对数变换会有所帮助。

import numpy as np
df_log = df.copy()
df_log.population = np.log(df.population)
df_log.population.plot()

人口的对数

Now let us perform a linear regression using OpenTURNS.现在让我们使用 OpenTURNS 执行线性回归。

import openturns as ot
sam = ot.Sample(np.array(df_log)) # convert DataFrame to openturns Sample
sam.setDescription(['year', 'logarithm of the population'])
linreg = ot.LinearModelAlgorithm(sam[:, 0], sam[:, 1])
linreg.run()
linreg_result = linreg.getResult()
coeffs = linreg_result.getCoefficients()
print("Best fitting line = {} + year * {}".format(coeffs[0], coeffs[1]))
print("R2 score = {}".format(linreg_result.getRSquared()))
ot.VisualTest_DrawLinearModel(sam[:, 0], sam[:, 1], linreg_result)

Output:输出:

Best fitting line = -38.35148311467912 + year * 0.028172928802559845
R2 score = 0.9966261033648469

总体对数的线性回归

This is an almost exact fit.这几乎是精确的配合。

EDIT编辑

As suggested by @Prayson W. Daniel, here is the model fit after it is transformed back to the original scale.正如@Prayson W. Daniel 所建议的,这是转换回原始比例后的模型拟合。

# Get the original data in openturns Sample format
orig_sam = ot.Sample(np.array(df))
orig_sam.setDescription(df.columns)

# Compute the prediction in the original scale
predicted = ot.Sample(orig_sam) # start by copying the original data
predicted[:, 1] = np.exp(linreg_result.getMetaModel()(predicted[:, 0])) # overwrite with the predicted values
error = np.array((predicted - orig_sam)[:, 1]) # compute error
r2 = 1.0 - (error**2).mean() / df.population.var() # compute the R2 score in the original scale
print("R2 score in original scale = {}".format(r2))

# Plot the model
graph = ot.Graph("Original scale", "year", "population", True, '')
curve = ot.Curve(predicted)
graph.add(curve)
points = ot.Cloud(orig_sam)
points.setColor('red')
graph.add(points)
graph

Output:输出:

R2 score in original scale = 0.9979032805107133

模型适合原始比例

Sckit-learn's LinearRegression scores uses 𝑅2 score. Sckit-learn 的 LinearRegression 分数使用 𝑅2 分数。 A negative 𝑅2 means that the model fitted your data extremely bad.负值 𝑅2 表示模型对您的数据的拟合非常糟糕。 Since 𝑅2 compares the fit of the model with that of the null hypothesis( a horizontal straight line ), then 𝑅2 is negative when the model fits worse than a horizontal line.由于𝑅2 将模型的拟合度与零假设的拟合度(水平直线)进行比较,那么当模型拟合度低于水平线时,𝑅2 为负。

𝑅2 = 1 - (SUM((y - ypred)**2) / SUM((y - AVG(y))**2))

So if SUM((y - ypred)**2 is greater than SUM((y - AVG(y))**2 , then 𝑅2 will be negative.因此,如果SUM((y - ypred)**2大于SUM((y - AVG(y))**2 ,那么 𝑅2 将为负数。

reasons and ways to correct it原因和纠正方法

Problem 1 : You are performing a random split of time-series data.问题 1 :您正在执行时间序列数据的随机拆分。 Random split will ignore the temporal dimension.随机拆分将忽略时间维度。
Solution : Preserve time flow (See code below)解决方案:保留时间流(见下面的代码)

Problem 2 : Target values are so large.问题 2 :目标值太大。
Solution : Unless we use Tree-base models, you would have to do some target feature engineering to scale data in a range that models can learn.解决方案:除非我们使用 Tree-base 模型,否则您必须进行一些目标特征工程才能在模型可以学习的范围内缩放数据。

Here is a code example.这是一个代码示例。 Using defaults parameters of LinearRegression and log|exp transformation of our target values, my attempt yield ~87% R2 score:使用 LinearRegression 的默认参数和目标值的log|exp转换,我的尝试产生了 ~87% 的 R2 分数:


import pandas as pd
import numpy as np

# we need to transform/feature engineer our target
# I will use log from numpy. The np.log and np.exp to make the value learnable

from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor

# your data, df

# transform year to reference

df = df.assign(ref_year = lambda x: x.year - 1960)
df.population = df.population.astype(int)

split = int(df.shape[0] *.9) #split at 90%, 10%-ish

df = df[['ref_year', 'population']]

train_df = df.iloc[:split]
test_df = df.iloc[split:]

X_train = train_df[['ref_year']]
y_train = train_df.population

X_test = test_df[['ref_year']]
y_test = test_df.population


# regressor
regressor = LinearRegression()

lr = TransformedTargetRegressor(
        regressor=regressor, 
        func=np.log, inverse_func=np.exp)

lr.fit(X_train,y_train)
print(lr.score(X_test,y_test))

For those interested in making it better, here is a way to read that dataset对于那些有兴趣让它变得更好的人,这是一种阅读该数据集的方法

import pandas as pd
import io

df = pd.read_csv(io.StringIO('''year,population
1960,22151278.0 
1961,22671191.0 
1962,23221389.0 
1963,23798430.0 
1964,24397022.0 
1965,25013626.0 
1966,25641044.0 
1967,26280132.0 
1968,26944390.0 
1969,27652709.0 
1970,28415077.0 
1971,29248643.0 
1972,30140804.0 
1973,31036662.0 
1974,31861352.0 
1975,32566854.0 
1976,33128149.0 
1977,33577242.0 
1978,33993301.0 
1979,34487799.0 
1980,35141712.0 
1981,35984528.0 
1982,36995248.0 
1983,38142674.0 
1984,39374348.0 
1985,40652141.0 
1986,41965693.0 
1987,43329231.0 
1988,44757203.0 
1989,46272299.0 
1990,47887865.0 
1991,49609969.0 
1992,51423585.0 
1993,53295566.0 
1994,55180998.0
1995,57047908.0 
1996,58883530.0 
1997,60697443.0 
1998,62507724.0 
1999,64343013.0 
2000,66224804.0 
2001,68159423.0 
2002,70142091.0 
2003,72170584.0 
2004,74239505.0
2005,76346311.0
2006,78489206.0 
2007,80674348.0 
2008,82916235.0 
2009,85233913.0 
2010,87639964.0 
2011,90139927.0 
2012,92726971.0 
2013,95385785.0 
2014,98094253.0 
2015,100835458.0 
2016,103603501.0 
2017,106400024.0 
2018,109224559.0
'''))

Results:结果:在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM