简体繁体 English

机器学习 - 算法特征排序

[英]Machine Learning - Feature Ranking by Algorithms

原文 2019-01-04 07:01:48 4 2 machine-learning/ weka/ prediction/ feature-selection

I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome. 我有一个包含大约30个功能的数据集，我想找出哪些功能对结果贡献最大。 I have 5 algorithms: 我有5个算法：

Neural Networks 神经网络
Logistics 后勤
Naive 幼稚
Random Forest 随机森林
Adaboost Adaboost的

I read a lot about Information Gain technique and it seems it is independent of the machine learning algorithm used. 我阅读了很多关于信息增益技术的内容，它似乎与所使用的机器学习算法无关 。 It is like a preprocess technique. 它就像一个预处理技术。

My question follows, is it best practice to perform feature importance for each algorithm dependently or just use Information Gain. 我的问题如下，最佳做法是依赖于每个算法执行特征重要性或仅使用信息增益。 If yes what are the technique used for each ? 如果是，每种技术使用的技术是什么？

2 个解决方案

First of all, it's worth stressing that you have to perform the feature selection based on the training data only , even if it is a separate algorithm. 首先，值得强调的是，您必须仅根据训练数据执行特征选择，即使它是单独的算法。 During testing, you then select the same features from the test dataset. 在测试期间，您可以从测试数据集中选择相同的功能。

Some approaches that spring to mind: 我想到的一些方法：

Mutual information based feature selection (eg here ), independent of the classifier. 基于互信息的特征选择（例如，这里），独立于分类器。
Backward or forward selection (see stackexchange question ), applicable to any classifier but potentially costly since you need to train/test many models. 向后或向前选择（参见stackexchange问题），适用于任何分类器，但由于您需要训练/测试许多模型，因此可能成本很高。
Regularisation techniques that are part of the classifier optimisation, eg Lasso or elastic net . 作为分类器优化的一部分的正则化技术，例如Lasso或弹性网。 The latter can be better in datasets with high collinearity. 后者在具有高共线性的数据集中可以更好。
Principal components analysis or any other dimensionality reduction technique that groups your features ( example ). 主成分分析或任何其他降维技术，对您的功能进行分组（示例）。
Some models compute latent variables which you can use for interpretation instead of the original features (eg Partial Least Squares or Canonical Correlation Analysis ). 某些模型计算潜在变量，您可以将其用于解释而不是原始要素（例如，偏最小二乘或典型相关分析）。

Specific classifiers can aid interpretability by providing extra information about the features/predictors, off the top of my head: 特定的分类器可以通过提供有关功能/预测变量的额外信息来帮助解释，从头到尾：

Logistic regression: you can obtain a p-value for every feature. 逻辑回归：您可以获得每个要素的p值。 In your interpretation you can focus on those that are 'significant' (eg p-value <0.05). 在您的解释中，您可以专注于那些“重要”的（例如p值<0.05）。 (same for two-classes Linear Discriminant Analysis) （两类线性判别分析相同）
Random Forest: can return a variable importance index that ranks the variables from most to least important. 随机森林：可以返回变量重要性索引，该变量索引将变量从最重要到最不重要。

I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome. 我有一个包含大约30个功能的数据集，我想找出哪些功能对结果贡献最大。

This will depend on the algorithm. 这取决于算法。 If you have 5 algorithms, you will likely get 5 slightly different answers, unless you perform the feature selection prior to classification (eg using mutual information). 如果您有5种算法，除非您在分类之前执行特征选择（例如使用互信息），否则您可能会得到5个略有不同的答案。 One reason is that Random Forests and neural networks would pick up nonlinear relationships while logistic regression wouldn't. 一个原因是随机森林和神经网络将获得非线性关系，而逻辑回归则不会。 Furthermore, Naive Bayes is blind to interactions. 此外，Naive Bayes对交互视而不见。 So unless your research is explicitly about these 5 models, I would rather select one model and proceed with it. 因此，除非您的研究明确涉及这5个模型，否则我宁愿选择一个模型并继续进行。

Since your purpose is to get some intuition on what's going on, here is what you can do: 既然你的目的是为了对正在发生的事情有所了解，那么你可以做以下事情：

Let's start with Random Forest for simplicity, but you can do this with other algorithms too. 让我们从Random Forest开始，为简单起见，但您也可以使用其他算法。 First, you need to build a good model. 首先，你需要建立一个好的模型。 Good in the sense that you need to be satisfied with its performance and it should be Robust , meaning that you should use a validation and/or a test set. 很好，你需要对它的性能感到满意，它应该是Robust ，这意味着你应该使用验证和/或测试集。 These points are very important because we will analyse how the model takes its decisions, so if the model is bad you will get bad intuitions. 这些要点非常重要，因为我们将分析模型如何做出决策，因此如果模型不好，您将获得糟糕的直觉。

After having built the model, you can analyse it at two level : For the whole dataset (understanding your process), or for a given prediction. 构建模型后，您可以在两个级别进行分析：对于整个数据集（了解您的过程），或对于给定的预测。 For this task I suggest you to look at the SHAP library which computes features contributions (ie how much does a feature influences the prediction of my classifier) that can be used for both puproses. 对于此任务，我建议您查看SHAP库，该库计算可用于两个pupros的特征贡献（即特征影响我的分类器预测的程度）。

For detailled instructions about this process and more tools, you can look fast.ai excellent courses on the machine learning serie , where lessons 2/3/4/5 are about this subject. 有关此过程和更多工具的详细说明，您可以快速查看机器学习系列的优秀课程，其中有关于此主题的课程2/3/4/5。

Hope it helps! 希望能帮助到你！