简体   繁体   中英

How to increase true positive in your classification Machine Learning model?

I am new to Machine Learning I have a dataset which has highly unbalanced classes(dominated by negative class) and contains more than 2K numeric features and the target is [0,1]. I have trained a logistics regression though I am getting an accuracy of 89% but from confusion matrix, it was found the model True positive is very low. Below are the scores of my model

Accuracy Score: 0.8965989500114129

Precision Score: 0.3333333333333333

Recall Score: 0.029545454545454545

F1 Score: 0.05427974947807933

How I can increase my True Positives? Should I be using a different classification model?

I have tried the PCA and represented my data in 2 components, it increased the model accuracy up to 90%(approx) however True Positives was decreased again

There are several ways to do this:

  • You can change your model and test whether it performs better or not
  • You can Fix a different prediction threshold: here I guess you predict 0 if the output of your regression is <0.5, you could change the 0.5 into 0.25 for example. It would increase your True Positive rate, but of course, at the price of some more False Positives.
  • You can duplicate every positive example in your training set so that your classifier has the feeling that classes are actually balanced.
  • You could change the loss of the classifier in order to penalize more False Negatives (this is actually pretty close to duplicating your positive examples in the dataset)

I'm sure many other tricks could apply, here is just my favorite short-list.

I'm assuming that your purpose is to obtain a model with good classification accuracy on some test set, regardless of the form of that model. In that case, if you have access to the computational resources, try Gradient-Boosted Trees. That's a ensemble classifier using multiple decision trees on subsets of your data, then a voting ensemble to make predictions. As far as I know, it can give good results with unbalanced class counts.

SciKitLearn has the function sklearn.ensemble.GradientBoostingClassifier for this. I have not used that particular one, but I use the regression version often and it seems good. I'm pretty sure MATLAB has this as a package too, if you have access.

2k features might be difficult for the SKL algorithm - I don't know I've never tried.

What is the size of your dataset?How many rows are we talking here?

Your dataset is not balanced and so its kind of normal for a simple classification algorithm to predict the 'majority-class' most of the times and give you an accuracy of 90%. Can you collect more data that will have more positive examples in it.

Or, just try oversampling/ under-sampling. see if that helps.

You can also use penalized version of the algorithm to impose penalty, whenever a wrong class is predicted. That may help.

You can try many different solutions.

If you have quite a lot data points. For instance you have 2k 1s and 20k 0s. You can try just dump those extra 0s only keep 2k 0s. Then train it. And also you can try to use different set of 2k 0s and same set of 2k 1s. To train multiple models. And make decision based on multiple models.

You also can try adding weights at the output layer. For instance, you have 10 times 0s than 1s. Try to multiply 10 at the 1s prediction value.

Probably you also can try to increase dropout?

And so on.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM