简体   繁体   中英

Newbie : How evaluate model to increase accuracy model in classification

my data

在此处输入图像描述

how do I increase the accuracy of the model, if some of my models when run produce results like the one below `

from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.6780893042575286

` Random Forest Classifier: Accuracy: 0.6780893042575286

There are several ways to achieve this:

  1. Look at the data. Are they in the best shape for the algorithm? Regarding NaN, Covariance and so on? Are they normalized, are the categorical ones translated well? This is a question too far-reaching for a forum.

  2. Look at the problem and the different algorithm suitable for this problem. Maybe

  • Logistic Regression
  • SVN
  • XGBoost
  • ....
  1. Try hyper parameter tuning with RandomisedsearvCV or GridSearchCV

This is quite high-level.

In terms of model selection, you can use a function like the below to find a good model that suits the problem.

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn import model_selection
from sklearn.utils import class_weight
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


def mutli_model(X_train, y_train, X_test, y_test):
""" Function to determine best model archietecture """

    dfs = []
    models = [
              ('LogReg', LogisticRegression()), 
              ('RF', RandomForestClassifier()),
              ('KNN', KNeighborsClassifier()),
              ('SVM', SVC()), 
              ('GNB', GaussianNB()),
              ('XGB', XGBClassifier(eval_metric="error"))
            ]

    results = []
    names = []
    scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc']
    target_names = ['App_Status_1', 'App_Status_2']

    for name, model in models:
            kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=90210)
            cv_results = model_selection.cross_validate(model, X_train, y_train, cv=kfold, scoring=scoring)
            clf = model.fit(X_train, y_train)
            y_pred = clf.predict(X_test)
            print(name)
            print(classification_report(y_test, y_pred, target_names=target_names))
            results.append(cv_results)
            names.append(name)

            this_df = pd.DataFrame(cv_results)
            this_df['model'] = name
            dfs.append(this_df)
            
    final = pd.concat(dfs, ignore_index=True)
    return final

After model selection, you can do something called Hyperparameter tuning which will further increase the model's performance.

If you want to further improve the model, you implement techniques like Data Augmentation and also revisit the cleaning phase of your data.

If after all that, if it still doesn't improve you could try collecting more data or refocus the problem statement.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM