简体   繁体   中英

High recall, low precision with EasyEnsembleClassifier

I have a dataset which has 450.000 data points, 12 features and label(0 or 1). I am using imblearn library of python because my dataset is imbalanced(ratio= 1:50, class 1 is minority). I am using EasyEnsembleClassifier as classifier. My problem is; I get high recall but very low precision as you can see from image below(90% recall, 8% precision, 14% f1 score).

Here is my code:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from dask_ml.preprocessing import RobustScaler
from imblearn.ensemble import EasyEnsembleClassifier
from sklearn.metrics import classification_report, f1_score, accuracy_score, precision_score, confusion_matrix
from sklearn import metrics

df = read_csv(...)
X = df[['features...']]
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = RobustScaler()
X_train = scaler.fit_transform(X_train)
clf = EasyEnsembleClassifier(n_estimators=50, n_jobs=-1, sampling_strategy = 1.0)

clf.fit(X_train, y_train)
X_test = scaler.transform(X_test)
y_pred = clf.predict(X_test)

------code for report------
.............

Output:

Classification Report

I tried different scalers namely MinMaxScaler, StandardScaler. I tried changing test-train split ratio, different parameters of EasyEnsembleClassifier. I also tried BalancedRandomForestClassifier from same library but result are same. Changing number of estimators in classifier parameter also doesn't change the result.

What is the reason of this results? What can I do to improve precision without damaging recall? It looks like I am doing something wrong in my code or I am missing an important concept.

Edit: I still couldn't figure out the true reason of my problem but since no one answered my question here is some ideas about what could be the reason of this weird model in case someone else encounters with similar problem;

  • Most probably my dataset is poorly labeled. It is possible that model cannot distinguish classes because they are very alike. I will try to generate some synthetic data to train my model again.
  • I did not test this but some features may be harming the model. I need to visually inspect to find out if there is correlation between features and remove some of them but I highly suspect this is the problem because boosting classifiers should handle this problem automatically by weighting each feature.
  • Also 12 features in my case may not be enough. I may need more. Although it is not easy for my dataset to generate more features I will think about it.
  • Finally maybe undersampling is not suited for my dataset. I will give a shot to oversampling techniques or SMOTE if I feel desperate enough.

You could try other ensemble methods for class imbalance learning. SMOTEBoost is one such method that combines boosting and data sampling method, essential injects SMOTE technique at each boosting iteration.

This article could be of interest to you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM