简体   繁体   中英

how to predict binary outcome with categorical and continuous features using scikit-learn?

I need advice choosing a model and machine learning algorithm for a classification problem.

I'm trying to predict a binary outcome for a subject. I have 500,000 records in my data set and 20 continuous and categorical features. Each subject has 10--20 records. The data is labeled with its outcome.

So far I'm thinking logistic regression model and kernel approximation, based on the cheat-sheet here .

I am unsure where to start when implementing this in either R or Python.

Thanks!

Choosing an algorithm and optimizing the parameter is a difficult task in any data mining project. Because it must customized for your data and problem. Try different algorithm like SVM,Random Forest, Logistic Regression, KNN and... and test Cross Validation for each of them and then compare them. You can use GridSearch in sickit learn to try different parameters and optimize the parameters for each algorithm. also try this project witch test a range of parameters with genetic algorithm

Features

If your categorical features don't have too many possible different values, you might want to have a look at sklearn.preprocessing.OneHotEncoder .

Model choice

The choice of "the best" model depends mainly on the amount of available training data and the simplicity of the decision boundary you expect to get.

You can try dimensionality reduction to 2 or 3 dimensions. Then you can visualize your data and see if there is a nice decision boundary.

With 500,000 training examples you can think about using a neural network. I can recommend Keras for beginners and TensorFlow for people who know how neural networks work.

You should also know that there are Ensemble methods .

A nice cheat sheet what to use is on in the sklearn tutorial you already found:


(source: scikit-learn.org )

Just try it, compare different results. Without more information it is not possible to give you better advice.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM