简体   繁体   English

python scikit线性回归怪异结果

[英]python scikit linear-regression weird results

im new to python. 我是python的新手。

Im tring to plot, using matplotlib, the results from linea regression. 使用matplotlib绘制线性回归的结果。

I've tried with some basic data and it worked, but when i try with the actual data, the regression line is compltetely wrong. 我已经尝试了一些基本数据并且可以正常工作,但是当我尝试实际数据时,回归线完全错误。 I think im doing something wrong with the fit() or predict() functions. 我认为我在fit()或predict()函数中做错了什么。

this is the code : 这是代码:

import matplotlib.pyplot as plt
from sklearn import linear_model
import scipy
import numpy as np
regr=linear_model.LinearRegression()
A=[[69977, 4412], [118672, 4093], [127393, 12324], [226158, 15453], [247883, 8924], [228057, 6568], [350119, 4040], [197808, 6793], [205989, 8471], [10666, 632], [38746, 1853], [12779, 611], [38570, 1091], [38570, 1091], [95686, 8752], [118025, 17620], [79164, 13335], [83051, 1846], [4177, 93], [29515, 1973], [75671, 5070], [10077, 184], [78975, 4374], [187730, 17133], [61558, 2521], [34705, 1725], [206514, 10548], [13563, 1734], [134931, 7117], [72527, 6551], [16014, 310], [20619, 403], [21977, 437], [20204, 258], [20406, 224], [20551, 375], [38251, 723], [20416, 374], [21125, 429], [20405, 235], [20042, 431], [20016, 366], [19702, 200], [20335, 420], [21200, 494], [22667, 487], [20393, 405], [20732, 414], [20602, 393], [111705, 7623], [112159, 5982], [6750, 497], [59624, 418], [111468, 10209], [40057, 1484], [435, 0], [498848, 17053], [26585, 1390], [75170, 3883], [139146, 3540], [84931, 7214], [19144, 3125], [31144, 2861], [66573, 818], [114253, 4155], [15421, 2094], [307497, 5110], [484904, 10273], [373476, 36365], [128152, 10920], [517285, 106315], [453483, 10054], [270763, 17542], [9068, 362], [61992, 1608], [35791, 1747], [131215, 6227], [4314, 191], [16316, 2650], [72791, 2077], [47008, 4656], [10853, 1346], [66708, 4855], [214736, 11334], [46493, 4236], [23042, 737], [335941, 11177], [65167, 2433], [94913, 7523], [454738, 12335]]
#my data are selected from a Mysql DB  and stored in np array like this one above.



regr.fit(A,A[:,1]) 
plt.scatter(A[:,0],A[:,1], color='black')
plt.plot(A[:,1],regr.predict(A), color='blue',linewidth=3)
plt.show()

what a want is a regression line using the data from the first column of A and the second column. 所需的是使用A的第一列和第二列的数据的回归线。 And this is the result: 结果如下:

在此处输入图片说明

I know that the presence of outlier can havily impact on the output , but i tried with other tolls for regression and the regression line was a lot closer to the area where points are, so im sure im missing something. 我知道异常值的存在会严重影响输出,但是我尝试使用其他收费进行回归,回归线距离点所在的区域更近,所以我肯定会丢失一些东西。

Thank you. 谢谢。

EDIT 1: as suggested i tried again changing only the plot() param . 编辑1:按照建议,我再次尝试仅更改plot()参数。 Instead of A[:,1] i used A[:,0] and this is the result : 我使用A [:,0]而不是A [:,1],这是结果:

在此处输入图片说明

A simple example at scikit-learn.org/stable/modules/linear_model.html , looks like mine. scikit-learn.org/stable/modules/linear_model.html上的一个简单示例看起来像我的。 I dont need prediction so i didnt sliced my data in training and test set...maybe im misunderstading the meaning of "X,y", but again , looking at the example in the link, it looks like mine. 我不需要预测,所以我没有在训练和测试集中切出我的数据……也许是我误解了“ X,y”的含义,但是再次看一下链接中的示例,它看起来像我的。

EDIT 2: finally it worked. 编辑2:终于成功了。

X=A[:,0]
X=X[:,np.newaxis]
regr=linear_model.LinearRegression()
regr.fit(X,A[:,1])
plt.plot(X,regr.predict(X))

the X param just need to be a 2 Dim array. X参数仅需要是2 Dim数组。 The example in EDIT 1 really misleaded me :(. 编辑1中的示例确实误导了我:(。

You seem to be including the target values A[:, 1] in your training data. 您似乎在训练数据中包括目标值A[:, 1] The fitting command is of the form regr.fit(X, y) . 拟合命令的格式为regr.fit(X, y)

You also seem to have a problem with this line: 您似乎也对这一行有疑问:

plt.plot(A[:,1],regr.predict(A), color='blue',linewidth=3)

I think that should you should replace A[:, 1] with A[:, 0] , if you want to to plot your prediction against the predictor values. 我认为,如果要针对预测变量值绘制预测,应该将A[:, 1]替换为A[:, 1] A[:, 0]

You may find it easier to split your data into X and y at the beginning - it may make things clearer. 您可能会发现更容易在开始时将数据分为Xy ,这可能会使事情变得更清楚。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM