简体繁体 English

机器学习问题中的特征选择

[英]Feature Selection in Machine Learning Question

原文 2020-06-25 14:26:54 3 1 machine-learning/ feature-extraction/ feature-selection/ feature-engineering

I am trying to predict y, a column of 0s and 1s (classification), using features (X).我正在尝试使用特征 (X) 预测 y，一列 0 和 1（分类）。 I'm using ML models like XGBoost.我正在使用像 XGBoost 这样的 ML 模型。

One of my features, in reality, is highly predictive, let's call it X1.实际上，我的一项功能是高度预测性的，我们称之为 X1。 X1 is a column of -1/0/1. X1 是一列 -1/0/1。 When X1 = 1, 80% of the time y = 1. When X1 = -1, 80% of the time y = 0. When X1 = 0, it has no correlation with y.当 X1 = 1 时，80% 的时间 y = 1。当 X1 = -1 时，80% 的时间 y = 0。当 X1 = 0 时，它与 y 没有相关性。

So in reality, ML aside, any sane person would select this in their model, because if you see X1 = 1 or X1 = -1 you have a 80% chance of predicting whether y is 0 or 1.所以实际上，除了 ML，任何理智的人都会在他们的 model 中看到 select，因为如果你看到 X1 = 1 或 X1 = -1，你有 80% 的机会预测 y 是 0 还是 1。

However X1 is only -1 or 1 about 5% of the time, and is 0 95% of the time.然而，X1 只有大约 5% 的时间为 -1 或 1，而 95% 的时间为 0。 When I run it through feature selection techniques like Sequential Feature Selection, it doesn't get chosen, And I can understand why ML doesn't choose it.当我通过诸如顺序特征选择之类的特征选择技术运行它时，它没有被选择，我可以理解为什么 ML 不选择它。 because 95% of the time it is a 0 (and thus uncorrelated with y), And so for any score that I've come across.因为 95% 的时间它是 0（因此与 y 不相关），所以对于我遇到的任何分数。 models with X1 don't score well. X1 的模型得分不高。

So my question is more generically, how can one deal with this paradox between ML technique and real-life logic?所以我的问题更笼统，如何处理 ML 技术和现实生活逻辑之间的这种悖论？ What can I do differently in ML feature selection/modelling to take advantage of the information embedded in the X1 -1's and 1's, which I know (in reality) are highly predictive?我可以在 ML 特征选择/建模中做些什么不同的事情，以利用嵌入在 X1 -1 和 1 中的信息，我知道（实际上）这些信息是高度可预测的？ What feature selection technique would have spotted the predictive power of X1, if we didn't know anything about it?如果我们对此一无所知，什么特征选择技术会发现 X1 的预测能力？ So far, all methods that I know of need predictive power to be unconditional.到目前为止，我所知道的所有方法都需要无条件的预测能力。 Instead, here X1 is highly predictive conditional on not being 0 (which is only 5% of the time).相反，这里 X1 是高度可预测的，条件是不为 0（只有 5% 的时间）。 What methods are out there to capture this?有什么方法可以捕捉到这一点？

Many thanks for any insight!非常感谢您的任何见解！

1 个解决方案

Probably sklearn.feature_selection.RFE would be a good option, since it is not really dependant on the feature selection method.可能sklearn.feature_selection.RFE将是一个不错的选择，因为它并不真正依赖于特征选择方法。 What I mean by that, is that it recursively fits the estimator you're planning to use and smaller on smaller subsets of features, and recursively removes features with the lowest scores until a desired amount of features is reached.我的意思是，它递归地适合您计划使用的估计器，并且在较小的特征子集上更小，并递归地删除得分最低的特征，直到达到所需的特征数量。

This seems like a good appraoch, since regardless of whether the feature in question seems more or less of a good predictor to you, this feature selection method tells you how important the feature is to the model .这似乎是一个很好的方法，因为无论所讨论的特征对您来说似乎或多或少是一个好的预测指标，这种特征选择方法都会告诉您该特征对 model 的重要性。 So if a feature is not considered, it is not as relevant to the model in question.因此，如果不考虑某个功能，则它与所讨论的 model 无关。