简体   繁体   English

Python 线性回归:plt.plot() 未显示直线。 相反,它连接分散 plot 上的每个点

[英]Python Linear regression : plt.plot() not showing straight line. Instead it connects every point on scatter plot

I am relatively new to python.我对 python 比较陌生。 I am trying to do a multivariate linear regression and plot scatter plots and the line of best fit using one feature at a time.我正在尝试一次使用一个特征进行多元线性回归和 plot 散点图和最佳拟合线。

This is my code:这是我的代码:

Train=df.loc[:650] 
valid=df.loc[651:]

x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]

x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()

regr=linear_model.LinearRegression()
regr.fit(x_train,y_train)

y_pred=regr.predict(x_test)

plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)

plt.show()

And this is the graph that I'm getting-这是我得到的图表-

在此处输入图像描述

I have tried searching a lot but to no avail.我尝试了很多搜索但无济于事。 I wanted to understand why this is not showing a line of best-fit and why instead it is connecting all the points on the scatter plot.我想了解为什么这没有显示一条最佳拟合线,而是为什么它连接了散点图 plot 上的所有点。

Thank you!谢谢!

See linear regression means, that you are predicting the value linearly which will always give you a best fit line.请参阅线性回归意味着,您正在线性预测值,这将始终为您提供最佳拟合线。 Anything else is not possible, in your code:在您的代码中,其他任何事情都是不可能的:

Train=df.loc[:650] 
valid=df.loc[651:]

x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]

x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()

regr=linear_model.LinearRegression()
regr.fit(x_train,y_train)

y_pred=regr.predict(x_test)

plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)

plt.show()

Use the right variables to plot the line ie:使用正确的变量到 plot 行,即:

plt.plot(x_test,y_pred)

Plot the graph between the values that you put for test and the predictions that you get from that ie: Plot 用于测试的值与从中获得的预测之间的图表,即:

y_pred=regr.predict(x_test)

Also your model must be trained for the same, otherwise you will get the straight line but the results will be unexpected.您的 model 也必须接受相同的训练,否则您会得到直线,但结果会出乎意料。

This is a multivariant data so you need to get the pairwise line http://www.sthda.com/english/articles/32-r-graphics-essentials/130-plot-multivariate-continuous-data/#:~:text=wiki%2F3d%2Dgraphics-,Create%20a%20scatter%20plot%20matrix,pairwise%20comparison%20of%20multivariate%20data.&text=Create%20a%20simple%20scatter%20plot%20matrix .这是一个多变量数据,因此您需要获得成对线http://www.sthda.com/english/articles/32-r-graphics-essentials/130-plot-multivariate-continuous-data/#:~:text =wiki%2F3d%2Dgraphics-,Create%20a%20scatter%20plot%20matrix,pairwise%20comparison%20of%20multivariate%20data.&text=Create%20a%20simple%20scatter%20plot%20matrix

or change the model for a linearly dependent data that will change the model completely或更改 model 以获得将完全改变 model 的线性相关数据

Train=df.loc[:650] 
valid=df.loc[651:]

x_train=Train[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_train=Train['sales'].dropna()
y_train=y_train.loc[7:]

x_test=valid[['lag_7','rolling_mean', 'expanding_mean']].dropna()
y_test=valid['sales'].dropna()

regr=linear_model.LinearRegression()
regr.fit(x_train['lag_7'],y_train)

y_pred=regr.predict(x_test['lag_7'])

plt.scatter(x_test['lag_7'], y_pred,color='black')
plt.plot(x_test['lag_7'],y_pred, color='blue', linewidth=3)

plt.show()

Assuming your graphical library is matplotlib, imported with import matplotlib.pyplot as plt , the problem is that you passed the same data to both plt.scatter and plt.plot .假设您的图形库是 matplotlib,使用import matplotlib.pyplot as plt ,问题是您将相同的数据传递给plt.scatterplt.plot The former draws the scatter plot, while the latter passes a line through all points in the order given (it first draws a straight line between (x_test['lag_7'][0], y_pred[0]) and (x_test['lag_7'][1], y_pred[1]) , then one between (x_test['lag_7'][1], y_pred[1]) and (x_test['lag_7'][2], y_pred[2]) , etc.)前者绘制散点 plot,而后者按给定顺序通过所有点(它首先在(x_test['lag_7'][0], y_pred[0])(x_test['lag_7'][1], y_pred[1]) ] 之间绘制一条直线) (x_test['lag_7'][1], y_pred[1]) ,然后是(x_test['lag_7'][1], y_pred[1])(x_test['lag_7'][2], y_pred[2])的一个,等等.)

Concerning the more general question about how to do multivariate regression and plot the results, I have two remarks:关于如何进行多元回归和 plot 结果的更一般的问题,我有两个评论:

  • Finding the line of best fit one feature at a time amounts to performing 1D regression on that feature: it is an altogether different model from the multivariate linear regression you want to perform.一次找到最适合一个特征的线相当于对该特征执行一维回归:它与您要执行的多元线性回归完全不同。

  • I don't think it makes much sense to split your data into train and test samples, because linear regression is a very simple model with little risk of overfitting.我认为将数据拆分为训练和测试样本没有多大意义,因为线性回归是一个非常简单的 model,过拟合的风险很小。 In the following, I consider the whole data set df .下面,我考虑整个数据集df

I like to use OpenTURNS because it has built-in linear regression viewing facilities.我喜欢使用 OpenTURNS,因为它具有内置的线性回归查看工具。 The downside is that to use it, we need to convert your pandas tables ( DataFrame or Series ) to OpenTURNS objects of the class Sample .缺点是要使用它,我们需要将您的 pandas 表( DataFrameSeries )转换为 class Sample的 OpenTURNS 对象。

import pandas as pd
import numpy as np
import openturns as ot
from openturns.viewer import View

# convert pandas DataFrames to numpy arrays and then to OpenTURNS Samples
X = ot.Sample(np.array(df[['lag_7','rolling_mean', 'expanding_mean']]))
X.setDescription(['lag_7','rolling_mean', 'expanding_mean']) # keep labels
Y = ot.Sample(np.array(df[['sales']]))
Y.setDescription(['sales'])

You did not provide your data, so I need to generate some:你没有提供你的数据,所以我需要生成一些:

func = ot.SymbolicFunction(['x1', 'x2', 'x3'], ['4*x1 + 0.05*x2 - 2*x3'])
inputs_distribution = ot.ComposedDistribution([ot.Uniform(0, 3.0e6)]*3)
residuals_distribution = ot.Normal(0.0, 2.0e6)
ot.RandomGenerator.SetSeed(0)
X = inputs_distribution.getSample(30)
X.setDescription(['lag_7','rolling_mean', 'expanding_mean'])
Y = func(X) + residuals_distribution.getSample(30)
Y.setDescription(['sales'])

Now, let us find the best-fitting line one feature at a time (1D linear regression):现在,让我们一次找到一个特征的最佳拟合线(一维线性回归):

linear_regression_1 = ot.LinearModelAlgorithm(X[:, 0], Y)
linear_regression_1.run()
linear_regression_1_result = linear_regression_1.getResult()
ot.VisualTest_DrawLinearModel(X[:, 0], Y, linear_regression_1_result)

lag_7 的线性回归

linear_regression_2 = ot.LinearModelAlgorithm(X[:, 1], Y)
linear_regression_2.run()
linear_regression_2_result = linear_regression_2.getResult()
View(ot.VisualTest_DrawLinearModel(X[:, 1], Y, linear_regression_2_result))

rolling_mean 的线性回归

linear_regression_3 = ot.LinearModelAlgorithm(X[:, 2], Y)
linear_regression_3.run()
linear_regression_3_result = linear_regression_3.getResult()
View(ot.VisualTest_DrawLinearModel(X[:, 2], Y, linear_regression_3_result))

expand_mean 的线性回归

As you can see, in this example, none of the one-feature linear regressions are able to very accurately predict the output.如您所见,在此示例中,没有一个单特征线性回归能够非常准确地预测 output。

Now let us do multivariate linear regression.现在让我们进行多元线性回归。 To plot the result, it is best to view the actual vs. predicted values.对于 plot 结果,最好查看实际值与预测值。

full_linear_regression = ot.LinearModelAlgorithm(X, Y)
full_linear_regression.run()
full_linear_regression_result = full_linear_regression.getResult()
full_linear_regression_analysis = ot.LinearModelAnalysis(full_linear_regression_result)
View(full_linear_regression_analysis.drawModelVsFitted())

多元线性回归

As you can see, in this example, the fit is much better with multivariate linear regression than with 1D regressions one feature at a time.如您所见,在此示例中,多元线性回归的拟合比一次一个特征的一维回归要好得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM