[英]Strange plot after linear regression using Numpy's least squares
I am doing linear regression with multiple variables. 我正在使用多个变量进行线性回归。 To get thetas (coefficients) I used Numpy's least-squares numpy.linalg.lstsq tool.
为了获得theta(系数),我使用了Numpy的最小二乘numpy.linalg.lstsq工具。 In my data I have n = 143 features and m = 13000 training examples.
在我的数据中,我具有n = 143个特征和m = 13000个训练示例。 I want to plot house prices against area and show fitting line for this feature.
我想针对面积绘制房价并显示此功能的拟合线。
Data preparation code (Python): 数据准备代码(Python):
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
path = 'DB2.csv'
data = pd.read_csv(path, header=None, delimiter=";")
data.insert(0, 'Ones', 1)
cols = data.shape[1]
X = data.iloc[:,0:cols-1]
y = data.iloc[:,cols-1:cols]
Getting theta coefficients with numpy.linalg.lstsq: 使用numpy.linalg.lstsq获取theta系数:
thetas = np.linalg.lstsq(X, y)[0]
Prediction part: 预测部分:
allAreasData = X.iloc[:,120] #Used as argument to scatter all training data
areasTestValues = X.iloc[0:100,120] #Used as argument for plot function
testingExamples = X.iloc[0:100,:] #Used to make predictions
predictions = testingExamples.dot(thetas)
Note: 120 in the above code is index of Area column in my dataset. 注意:以上代码中的120是我的数据集中的Area列的索引。
Visualization part: 可视化部分:
fig, ax = plt.subplots(figsize=(18,10))
ax.scatter(allAreasData, y, label='Traning Data', color='r')
ax.plot(areasTestValues, predictions, 'b', label='Prediction')
ax.legend(loc=2)
ax.set_xlabel('Area')
ax.set_ylabel('Price')
ax.set_title('Predicted Price vs. House Area')
I expected to get some single regression line that fits data but instead of it got such strange polyline (broken line). 我期望得到一些适合数据的回归线,但取而代之的是这样奇怪的折线(折线)。 What I am doing wrong?
我做错了什么? Scatter works right.
分散工作正常。 But plot is not.
但是情节不是。 For plot function I send 2 arguments:
对于plot函数,我发送2个参数:
1) Testing area data (100 area data examples)
2) Predictions of price based on 100 training examples that include area data
Update: After sorting x
I got this plot with curve: 更新:对
x
进行排序后,得到了带有曲线的图:
I was expecting to get straight line fitting all my data with least square errors but instead got a curve. 我原本希望得到的直线拟合所有数据的平方误差最小,但得到一条曲线。 Isn't linear regression and numpy.linalg.lstsq tool supposed to return straight fitting line instead of curve?
线性回归和numpy.linalg.lstsq工具是否应该返回直线拟合线而不是曲线?
Your result is linear in a 143 dimensional space. 您的结果在143维空间中是线性的。 ;) Since your X contains many more features than just the area the prediction will also (linearly) depend on those features.
;)由于您的X包含的特征不仅仅是区域,因此预测也(线性地)取决于这些特征。
If you redo your training with X = data.iloc[:,120] (only considering the area feature) you should receive a straight line when you plot the results. 如果使用X = data.iloc [:,120]重做训练(仅考虑区域特征),则在绘制结果时应该会获得一条直线。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.