简体   繁体   中英

Strange plot after linear regression using Numpy's least squares

I am doing linear regression with multiple variables. To get thetas (coefficients) I used Numpy's least-squares numpy.linalg.lstsq tool. In my data I have n = 143 features and m = 13000 training examples. I want to plot house prices against area and show fitting line for this feature.

Data preparation code (Python):

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt  

path = 'DB2.csv'  
data = pd.read_csv(path, header=None, delimiter=";")
data.insert(0, 'Ones', 1)

cols = data.shape[1]
X = data.iloc[:,0:cols-1]  
y = data.iloc[:,cols-1:cols] 

Getting theta coefficients with numpy.linalg.lstsq:

thetas = np.linalg.lstsq(X, y)[0]

Prediction part:

allAreasData = X.iloc[:,120] #Used as argument to scatter all training data
areasTestValues = X.iloc[0:100,120] #Used as argument for plot function 
testingExamples = X.iloc[0:100,:] #Used to make predictions

predictions = testingExamples.dot(thetas)

Note: 120 in the above code is index of Area column in my dataset.

Visualization part:

fig, ax = plt.subplots(figsize=(18,10))  
ax.scatter(allAreasData, y, label='Traning Data', color='r') 
ax.plot(areasTestValues, predictions, 'b', label='Prediction')  
ax.legend(loc=2)  
ax.set_xlabel('Area')  
ax.set_ylabel('Price')  
ax.set_title('Predicted Price vs. House Area')

Output plot: 在此处输入图片说明

I expected to get some single regression line that fits data but instead of it got such strange polyline (broken line). What I am doing wrong? Scatter works right. But plot is not. For plot function I send 2 arguments:

1) Testing area data (100 area data examples)
2) Predictions of price based on 100 training examples that include area data


Update: After sorting x I got this plot with curve: 在此处输入图片说明

I was expecting to get straight line fitting all my data with least square errors but instead got a curve. Isn't linear regression and numpy.linalg.lstsq tool supposed to return straight fitting line instead of curve?

Your result is linear in a 143 dimensional space. ;) Since your X contains many more features than just the area the prediction will also (linearly) depend on those features.

If you redo your training with X = data.iloc[:,120] (only considering the area feature) you should receive a straight line when you plot the results.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM