简体繁体 English

如何使用scikit-learn预测具有分类和连续特征的二进制结果？

[英]how to predict binary outcome with categorical and continuous features using scikit-learn?

原文 2016-07-29 14:44:28 1 2 python/ r/ machine-learning

I need advice choosing a model and machine learning algorithm for a classification problem. 我需要为分类问题选择模型和机器学习算法的建议。

I'm trying to predict a binary outcome for a subject. 我试图预测一个对象的二进制结果。 I have 500,000 records in my data set and 20 continuous and categorical features. 我的数据集中有500,000条记录，还有20个连续和分类特征。 Each subject has 10--20 records. 每个主题都有10--20条记录。 The data is labeled with its outcome. 数据标有其结果。

So far I'm thinking logistic regression model and kernel approximation, based on the cheat-sheet here . 到目前为止，我正在考虑基于此处的备忘单的逻辑回归模型和核近似。

I am unsure where to start when implementing this in either R or Python. 我不确定在R或Python中实现此功能时从何处开始。

Thanks! 谢谢！

2 个解决方案

Choosing an algorithm and optimizing the parameter is a difficult task in any data mining project. 在任何数据挖掘项目中，选择算法和优化参数都是一项艰巨的任务。 Because it must customized for your data and problem. 因为它必须针对您的数据和问题进行定制。 Try different algorithm like SVM,Random Forest, Logistic Regression, KNN and... and test Cross Validation for each of them and then compare them. 尝试使用不同的算法，例如SVM，随机森林，逻辑回归，KNN和...，并对每个算法进行交叉验证，然后进行比较。 You can use GridSearch in sickit learn to try different parameters and optimize the parameters for each algorithm. 您可以在病态学习中使用GridSearch尝试不同的参数并为每种算法优化参数。 also try this project witch test a range of parameters with genetic algorithm 也尝试这个项目，用遗传算法测试一系列参数

Features 特征

If your categorical features don't have too many possible different values, you might want to have a look at sklearn.preprocessing.OneHotEncoder . 如果分类功能没有太多可能的不同值，则可能需要查看sklearn.preprocessing.OneHotEncoder 。

Model choice 型号选择

The choice of "the best" model depends mainly on the amount of available training data and the simplicity of the decision boundary you expect to get. “最佳”模型的选择主要取决于可用训练数据的数量以及您期望获得的决策边界的简单性。

You can try dimensionality reduction to 2 or 3 dimensions. 您可以尝试将尺寸降低到2或3维。 Then you can visualize your data and see if there is a nice decision boundary. 然后，您可以可视化数据并查看是否存在良好的决策边界。

With 500,000 training examples you can think about using a neural network. 通过50万个训练示例，您可以考虑使用神经网络。 I can recommend Keras for beginners and TensorFlow for people who know how neural networks work. 我可以向初学者推荐Keras ，向那些了解神经网络如何工作的人推荐TensorFlow 。

You should also know that there are Ensemble methods . 您还应该知道有Ensemble方法。

A nice cheat sheet what to use is on in the sklearn tutorial you already found: 在您已经发现的sklearn教程中，有一个很好的备忘单：

_{(source: scikit-learn.org )} _{（来源： scikit-learn.org ）}

Just try it, compare different results. 只需尝试一下，比较不同的结果。 Without more information it is not possible to give you better advice. 没有更多信息，就不可能给您更好的建议。