[英]Getting 100% Accuracy on my DecisionTree Model
Here is my code, and it always returns 100% accuracy, regardless of how big the test size is.这是我的代码,无论测试规模有多大,它始终返回 100% 的准确率。 I used the train_test_split method, so I do not believe there should be any duplicates of data.
我使用了 train_test_split 方法,所以我认为不应该有任何重复的数据。 Could someone inspect my code?
有人可以检查我的代码吗?
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
prices.shape
(20640,)
features.shape
(20640, 8)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
X_train = X_train.dropna()
y_train = y_train.dropna()
X_test = X_test.dropna()
y_test = X_test.dropna()
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_train.shape
(16512,)
X_train.shape
(16512, 8)
predictions = model.predict(X_test)
score = model.score(y_test, predictions)
score
EDIT: I have reworked my answer since I found multiple issues.编辑:自从我发现多个问题以来,我已经重新设计了我的答案。 Please copy-paste the below code to ensure no bugs are left.
请复制粘贴以下代码以确保不留下任何错误。
Issues -问题 -
DecisionTreeClassifier
instead of DecisionTreeRegressor
for a regression problem.DecisionTreeClassifier
而不是DecisionTreeRegressor
来解决回归问题。nans
after doing the test train split which will mess up the count of samples.nans
,这会弄乱样本数量。 Do the data.dropna()
before the split.data.dropna()
。model.score(X_test, y_test)
incorrectly by passing it (X_test, predictions)
.(X_test, predictions)
错误地使用了model.score(X_test, y_test)
(X_test, predictions)
。 Please use accuracy_score(X_test, predictions)
with those parameters instead, or fix the syntax.accuracy_score(X_test, predictions)
代替,或修复语法。from sklearn.tree import DecisionTreeRegressor #<---- FIRST ISSUE
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
data = data.dropna() #<--- SECOND ISSUE
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions) #<----- THIRD ISSUE
score
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.