简体   繁体   中英

SVM and Random Forest with recall = 0

I am trying to predict one from two values which can appear in column 'exit'. I have clean data (about 20 columns and 4k rows contain typical information about customers like 'sex', 'age'...). In training dataset about 20% customers were qualified as '1'. I made two models- svm and random forest- but both predict for test dataset mostly '0' (almost everytime). Recall of two models is 0. I atached code where I think I could do some stupid mistake. Any ideas why recall is so low during 80% accuracies?

def ml_model():
    print('sklearn: %s' % sklearn.__version__)
    df = pd.read_csv('clean_data.csv')
    df.head()
    feat = df.drop(columns=['target'], axis=1)
    label = df["target"]
    x_train, x_test, y_train, y_test = train_test_split(feat, label, test_size=0.3)
    sc_x = StandardScaler()
    x_train = sc_x.fit_transform(x_train)

    # SVC method
    support_vector_classifier = SVC(probability=True)
    # Grid search
    rand_list = {"C": stats.uniform(0.1, 10),
                 "gamma": stats.uniform(0.1, 1)}
    auc = make_scorer(roc_auc_score)
    rand_search_svc = RandomizedSearchCV(support_vector_classifier, param_distributions=rand_list, n_iter=100, n_jobs=4, cv=3, random_state=42,
                                     scoring=auc)
    rand_search_svc.fit(x_train, y_train)
    support_vector_classifier = rand_search_svc.best_estimator_
    cross_val_svc = cross_val_score(estimator=support_vector_classifier, X=x_train, y=y_train, cv=10, n_jobs=-1)
    print("Cross Validation Accuracy for SVM: ", round(cross_val_svc.mean() * 100, 2), "%")
    predicted_y = support_vector_classifier.predict(x_test)
    tn, fp, fn, tp = confusion_matrix(y_test, predicted_y).ravel()
    precision_score = tp / (tp + fp)
    recall_score = tp / (tp + fn)
    print("Recall score SVC: ", recall_score)


    # Random forests
    random_forest_classifier = RandomForestClassifier()
    # Grid search
    param_dist = {"max_depth": [3, None],
                  "max_features": sp_randint(1, 11),
                  "min_samples_split": sp_randint(2, 11),
                  "bootstrap": [True, False],
                  "criterion": ["gini", "entropy"]}
    rand_search_rf = RandomizedSearchCV(random_forest_classifier, param_distributions=param_dist,
                                       n_iter=100, cv=5, iid=False)
    rand_search_rf.fit(x_train, y_train)
    random_forest_classifier = rand_search_rf.best_estimator_
    cross_val_rfc = cross_val_score(estimator=random_forest_classifier, X=x_train, y=y_train, cv=10, n_jobs=-1)
    print("Cross Validation Accuracy for RF: ", round(cross_val_rfc.mean() * 100, 2), "%")
    predicted_y = random_forest_classifier.predict(x_test)
    tn, fp, fn, tp = confusion_matrix(y_test, predicted_y).ravel()
    precision_score = tp / (tp + fp)
    recall_score = tp / (tp + fn)
    print("Recall score RF: ", recall_score)

    new_data = pd.read_csv('new_data.csv')
    new_data = cleaning_data_to_predict(new_data)
    if round(cross_val_svc.mean() * 100, 2) > round(cross_val_rfc.mean() * 100, 2):
        predictions = support_vector_classifier.predict(new_data)
        predictions_proba = support_vector_classifier.predict_proba(new_data)
    else:
        predictions = random_forest_classifier.predict(new_data)
        predictions_proba = random_forest_classifier.predict_proba(new_data)

    f = open("output.txt", "w+")
    for i in range(len(predictions.tolist())):
        print("id: ", i, "probability: ", predictions_proba.tolist()[i][1], "exit: ", predictions.tolist()[i], file=open("output.txt", "a"))

If I have not missed it, you forgot to scale your test set. So, you need to scale it as well. Note that you should just transform it, do not fit it again. See below.

x_test = sc_x.transform(x_test)

I agree with @e_kapti, also check the formula of the recall and accuracy, you might consider using the F1 Score instead ( https://en.wikipedia.org/wiki/F1_score ).

Recall = TP / (TP+FN) Accuracy = (TP + TN) / (TP + TN + FP + FN) With TP, FP, TN, FN being number of true positives, false positives, true negatives and false negatives, respectively.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM