简体   繁体   English

使用Numpy最小二乘法进行线性回归后的奇怪图

[英]Strange plot after linear regression using Numpy's least squares

I am doing linear regression with multiple variables. 我正在使用多个变量进行线性回归。 To get thetas (coefficients) I used Numpy's least-squares numpy.linalg.lstsq tool. 为了获得theta(系数),我使用了Numpy的最小二乘numpy.linalg.lstsq工具。 In my data I have n = 143 features and m = 13000 training examples. 在我的数据中,我具有n = 143个特征和m = 13000个训练示例。 I want to plot house prices against area and show fitting line for this feature. 我想针对面积绘制房价并显示此功能的拟合线。

Data preparation code (Python): 数据准备代码(Python):

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt  

path = 'DB2.csv'  
data = pd.read_csv(path, header=None, delimiter=";")
data.insert(0, 'Ones', 1)

cols = data.shape[1]
X = data.iloc[:,0:cols-1]  
y = data.iloc[:,cols-1:cols] 

Getting theta coefficients with numpy.linalg.lstsq: 使用numpy.linalg.lstsq获取theta系数:

thetas = np.linalg.lstsq(X, y)[0]

Prediction part: 预测部分:

allAreasData = X.iloc[:,120] #Used as argument to scatter all training data
areasTestValues = X.iloc[0:100,120] #Used as argument for plot function 
testingExamples = X.iloc[0:100,:] #Used to make predictions

predictions = testingExamples.dot(thetas)

Note: 120 in the above code is index of Area column in my dataset. 注意:以上代码中的120是我的数据集中的Area列的索引。

Visualization part: 可视化部分:

fig, ax = plt.subplots(figsize=(18,10))  
ax.scatter(allAreasData, y, label='Traning Data', color='r') 
ax.plot(areasTestValues, predictions, 'b', label='Prediction')  
ax.legend(loc=2)  
ax.set_xlabel('Area')  
ax.set_ylabel('Price')  
ax.set_title('Predicted Price vs. House Area')

Output plot: 输出图: 在此处输入图片说明

I expected to get some single regression line that fits data but instead of it got such strange polyline (broken line). 我期望得到一些适合数据的回归线,但取而代之的是这样奇怪的折线(折线)。 What I am doing wrong? 我做错了什么? Scatter works right. 分散工作正常。 But plot is not. 但是情节不是。 For plot function I send 2 arguments: 对于plot函数,我发送2个参数:

1) Testing area data (100 area data examples)
2) Predictions of price based on 100 training examples that include area data


Update: After sorting x I got this plot with curve: 更新:x进行排序后,得到了带有曲线的图: 在此处输入图片说明

I was expecting to get straight line fitting all my data with least square errors but instead got a curve. 我原本希望得到的直线拟合所有数据的平方误差最小,但得到一条曲线。 Isn't linear regression and numpy.linalg.lstsq tool supposed to return straight fitting line instead of curve? 线性回归和numpy.linalg.lstsq工具是否应该返回直线拟合线而不是曲线?

Your result is linear in a 143 dimensional space. 您的结果在143维空间中是线性的。 ;) Since your X contains many more features than just the area the prediction will also (linearly) depend on those features. ;)由于您的X包含的特征不仅仅是区域,因此预测也(线性地)取决于这些特征。

If you redo your training with X = data.iloc[:,120] (only considering the area feature) you should receive a straight line when you plot the results. 如果使用X = data.iloc [:,120]重做训练(仅考虑区域特征),则在绘制结果时应该会获得一条直线。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM