在我的 DecisionTree 模型上获得 100% 的准确性

Question

Here is my code, and it always returns 100% accuracy, regardless of how big the test size is.这是我的代码，无论测试规模有多大，它始终返回 100% 的准确率。 I used the train_test_split method, so I do not believe there should be any duplicates of data.我使用了 train_test_split 方法，所以我认为不应该有任何重复的数据。 Could someone inspect my code?有人可以检查我的代码吗？

from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


data = pd.read_csv('housing.csv')

prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)

prices.shape
(20640,)

features.shape
(20640, 8)


X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)

X_train = X_train.dropna()
y_train = y_train.dropna()
X_test = X_test.dropna()
y_test = X_test.dropna()

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

y_train.shape
(16512,)

X_train.shape
(16512, 8)


predictions = model.predict(X_test)
score = model.score(y_test, predictions)
score

Answer 1

EDIT: I have reworked my answer since I found multiple issues.编辑：自从我发现多个问题以来，我已经重新设计了我的答案。 Please copy-paste the below code to ensure no bugs are left.请复制粘贴以下代码以确保不留下任何错误。

Issues -问题 -

You are using DecisionTreeClassifier instead of DecisionTreeRegressor for a regression problem.您正在使用DecisionTreeClassifier而不是DecisionTreeRegressor来解决回归问题。
You are removing nans after doing the test train split which will mess up the count of samples.在进行测试训练拆分后，您正在删除nans ，这会弄乱样本数量。 Do the data.dropna() before the split.在拆分之前执行data.dropna() 。
You are using the model.score(X_test, y_test) incorrectly by passing it (X_test, predictions) .您通过传递它(X_test, predictions)错误地使用了model.score(X_test, y_test) (X_test, predictions) 。 Please use accuracy_score(X_test, predictions) with those parameters instead, or fix the syntax.请使用带有这些参数的accuracy_score(X_test, predictions)代替，或修复语法。

from sklearn.tree import DecisionTreeRegressor #<---- FIRST ISSUE
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


data = pd.read_csv('housing.csv')

data = data.dropna() #<--- SECOND ISSUE

prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)

model = DecisionTreeRegressor()
model.fit(X_train, y_train)

predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions) #<----- THIRD ISSUE
score

在我的 DecisionTree 模型上获得 100% 的准确性

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-11-23 18:09:37

在我的 DecisionTree 模型上获得 100% 的准确性

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-11-23 18:09:37

解决方案1
1 已采纳 2020-11-23 18:09:37