[英]Machine Learning - Feature Ranking by Algorithms
I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome. 我有一个包含大约30个功能的数据集,我想找出哪些功能对结果贡献最大。 I have 5 algorithms: 我有5个算法:
I read a lot about Information Gain technique and it seems it is independent of the machine learning algorithm used. 我阅读了很多关于信息增益技术的内容,它似乎与所使用的机器学习算法无关 。 It is like a preprocess technique. 它就像一个预处理技术。
My question follows, is it best practice to perform feature importance for each algorithm dependently or just use Information Gain. 我的问题如下,最佳做法是依赖于每个算法执行特征重要性或仅使用信息增益。 If yes what are the technique used for each ? 如果是,每种技术使用的技术是什么?
First of all, it's worth stressing that you have to perform the feature selection based on the training data only , even if it is a separate algorithm. 首先,值得强调的是,您必须仅根据训练数据执行特征选择,即使它是单独的算法。 During testing, you then select the same features from the test dataset. 在测试期间,您可以从测试数据集中选择相同的功能。
Some approaches that spring to mind: 我想到的一些方法:
Specific classifiers can aid interpretability by providing extra information about the features/predictors, off the top of my head: 特定的分类器可以通过提供有关功能/预测变量的额外信息来帮助解释,从头到尾:
I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome. 我有一个包含大约30个功能的数据集,我想找出哪些功能对结果贡献最大。
This will depend on the algorithm. 这取决于算法。 If you have 5 algorithms, you will likely get 5 slightly different answers, unless you perform the feature selection prior to classification (eg using mutual information). 如果您有5种算法,除非您在分类之前执行特征选择(例如使用互信息),否则您可能会得到5个略有不同的答案。 One reason is that Random Forests and neural networks would pick up nonlinear relationships while logistic regression wouldn't. 一个原因是随机森林和神经网络将获得非线性关系,而逻辑回归则不会。 Furthermore, Naive Bayes is blind to interactions. 此外,Naive Bayes对交互视而不见。 So unless your research is explicitly about these 5 models, I would rather select one model and proceed with it. 因此,除非您的研究明确涉及这5个模型,否则我宁愿选择一个模型并继续进行。
Since your purpose is to get some intuition on what's going on, here is what you can do: 既然你的目的是为了对正在发生的事情有所了解,那么你可以做以下事情:
Let's start with Random Forest for simplicity, but you can do this with other algorithms too. 让我们从Random Forest开始,为简单起见,但您也可以使用其他算法。 First, you need to build a good model. 首先,你需要建立一个好的模型。 Good in the sense that you need to be satisfied with its performance and it should be Robust , meaning that you should use a validation and/or a test set. 很好,你需要对它的性能感到满意,它应该是Robust ,这意味着你应该使用验证和/或测试集。 These points are very important because we will analyse how the model takes its decisions, so if the model is bad you will get bad intuitions. 这些要点非常重要,因为我们将分析模型如何做出决策,因此如果模型不好,您将获得糟糕的直觉。
After having built the model, you can analyse it at two level : For the whole dataset (understanding your process), or for a given prediction. 构建模型后,您可以在两个级别进行分析:对于整个数据集(了解您的过程),或对于给定的预测。 For this task I suggest you to look at the SHAP library which computes features contributions (ie how much does a feature influences the prediction of my classifier) that can be used for both puproses. 对于此任务,我建议您查看SHAP库 ,该库计算可用于两个pupros的特征贡献(即特征影响我的分类器预测的程度)。
For detailled instructions about this process and more tools, you can look fast.ai excellent courses on the machine learning serie , where lessons 2/3/4/5 are about this subject. 有关此过程和更多工具的详细说明,您可以快速查看机器学习系列的优秀课程,其中有关于此主题的课程2/3/4/5。
Hope it helps! 希望能帮助到你!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.