简体   繁体   中英

Random Forest Classifier accuracy doesn't get higher than 50%

I am very new to machine learning and I am trying to classify this UCI Heart Disease Dataset using sklearn's random forest classifier. My approach is very basic, and I wanted to ask how I could improve my accuracy with the algorithm (some tips, links, etc.). My accuracy tops out at about 50% every time. Here's my code:

import pandas as pd
import numpy as np
import random as random
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

df = pd.read_excel('/Users/Mady/Documents/ClevelandData.xlsx')
df.replace('?', -99999, inplace=True)

labels = df.iloc[:,-1]
labels = labels.values

df.drop(df.columns[len(df.columns)-1], axis=1, inplace=True)
riskFactors = df.values

random.seed(123)
random.shuffle(labels)
random.seed(123)
random.shuffle(riskFactors)

labels_train = labels[:(int(len(labels) * 0.8))]
labels_test = labels[(int(len(labels) * 0.8)):]

riskFactors_train = riskFactors[:(int(len(riskFactors) * 0.8))]
riskFactors_test = riskFactors[(int(len(riskFactors) * 0.8)):]

model = RandomForestClassifier(n_estimators = 1000)
model.fit(riskFactors_train,labels_train)
predicted_labels = model.predict(riskFactors_test)
acc = accuracy_score(labels_test,predicted_labels)
print(acc)

Solved this by removing the random part as there must have been some error there. As suggested by Yulin Zhang, I used the train_test_split provided by sklearn.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM