简体   繁体   English

使用Python构建预测模型。 投影始终为0

[英]Building a predictive model with Python. Projections are always 0

I am looking at some real estate data that I found online. 我正在查看我在网上找到的一些房地产数据。 I setup a model in Python; 我用Python建立了一个模型; all code is shown below. 所有代码如下所示。 All the data is from the boroughs of NYC, such as zip code, lotsize, commercial, residential, and a few other metrics. 所有数据均来自纽约市,例如邮政编码,手数,商业,住宅和其他一些指标。 I'm trying to predict a 'target' variable for potentially developing a commercial real-estate lot, based on various factors. 我正在尝试基于各种因素来预测可能开发商业房地产地段的“目标”变量。

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load data
train = pd.read_csv('C:\\Users\\Excel\\Desktop\\train.csv')
test = pd.read_csv('C:\\Users\\Excel\\Desktop\\test.csv')

df = pd.concat([train,test],axis=0) #Combined both Train and Test Data set
df.shape
pd.set_option('display.max_columns', None)


# fill in NANs.
df = df.fillna(0)

print('Data frame:', df)

# convert to numbers
df = df.select_dtypes(include=[np.number])


# Get all the columns from the dataframe.
columns = df.columns.tolist()
# Filter the columns to remove ones we don't want to use in the training
columns = [c for c in columns if c not in ['target']]


# Store the variable we'll be predicting on.
target = 'target'
train['target'] = 0
# Generate the training set.  Set random_state to be able to replicate results.
train = df.sample(frac=0.8, random_state=1)
# Select anything not in the training set and put it in the testing set.
test = df.loc[~df.index.isin(train.index)]
# Print the shapes of both sets.
print('Training set shape:', train.shape)
print('Testing set shape:', test.shape)
# Initialize the model class.
lin_model = LinearRegression()
# Fit the model to the training data.
lin_model.fit(train[columns], train[target])


# Generate our predictions for the test set.
lin_predictions = lin_model.predict(test[columns])
print('Predictions:', lin_predictions)
# Compute error between our test predictions and the actual values.
lin_mse = mean_squared_error(lin_predictions, test[target])
print('Computed error:', lin_mse)

This line is throwing an error: 这行抛出一个错误:

lin_model.fit(train[columns], train[target])

Here is the error message: 这是错误消息:

KeyError: 'target'

Basically, the 'target' field doesn't appear in here: train[target] 基本上,“目标”字段不会出现在此处: train[target]

Even when I add the field in, the projections are ALWAYS 0!!! 即使我将字段添加进去,投影也总是0! I must be missing something simple, but I'm not sure what. 我一定缺少简单的东西,但是我不确定。

I am following the example from here, but using a completely different data set. 我从这里开始跟随示例,但是使用了完全不同的数据集。

https://microsoft.github.io/sql-ml-tutorials/python/rentalprediction/step/2.html https://microsoft.github.io/sql-ml-tutorials/python/rentalprediction/step/2.html

I can get the 'feature importance' of factors using this snippet of code. 使用此代码片段,我可以了解因素的“功能重要性”。

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

viz = FeatureImportances(GradientBoostingClassifier(), ax=ax)
viz.fit(X, y)
viz.poof()

在此处输入图片说明

I'd like to add a comment but I can't yet. 我想添加评论,但还不能。 Why are you using a linear regression to predict what I assume is a binary variable? 为什么要使用线性回归来预测我认为是二元变量的变量? Use logistic instead. 请改用物流。 Also what is this line : columns = [c for c in columns if c not in ['target']] Where did ['target'] come from? 这条线又是什么: columns = [c for c in columns if c not in ['target']] ['target']来自哪里? Another thing, train['target'] = 0 sets this entire column = 0 even though you should be using a df.loc method instead if you want to reassign column values. 另一件事, train['target'] = 0将整个列设置为0,即使您要重新分配列值,也应该使用df.loc方法。 This is why you're getting all predicted values to be zero because the target is your dependent variable and all values are set to 0. 这就是为什么将所有预测值都设为零的原因,因为目标是您的因变量,并且所有值都设置为0。

If all the samples in the train set have output/target = 0, as you put in the code 如果输入代码,则训练集中的所有样本的输出/目标均为0

train['target'] = 0

Then the algorithm is going to learn that regardless of what features you have in the model, the prediction should always be 0. 然后,该算法将学习到,无论模型中具有什么功能,预测都应该始终为0。

Review why you need to put that as 0. This line seems to be unnecessary. 回顾为什么需要将其设置为0。这行似乎是不必要的。 Try removing that line and run the model. 尝试删除该行并运行模型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM