简体   繁体   English

机器学习问题中的特征选择

[英]Feature Selection in Machine Learning Question

I am trying to predict y, a column of 0s and 1s (classification), using features (X).我正在尝试使用特征 (X) 预测 y,一列 0 和 1(分类)。 I'm using ML models like XGBoost.我正在使用像 XGBoost 这样的 ML 模型。

One of my features, in reality, is highly predictive, let's call it X1.实际上,我的一项功能是高度预测性的,我们称之为 X1。 X1 is a column of -1/0/1. X1 是一列 -1/0/1。 When X1 = 1, 80% of the time y = 1. When X1 = -1, 80% of the time y = 0. When X1 = 0, it has no correlation with y.当 X1 = 1 时,80% 的时间 y = 1。当 X1 = -1 时,80% 的时间 y = 0。当 X1 = 0 时,它与 y 没有相关性。

So in reality, ML aside, any sane person would select this in their model, because if you see X1 = 1 or X1 = -1 you have a 80% chance of predicting whether y is 0 or 1.所以实际上,除了 ML,任何理智的人都会在他们的 model 中看到 select,因为如果你看到 X1 = 1 或 X1 = -1,你有 80% 的机会预测 y 是 0 还是 1。

However X1 is only -1 or 1 about 5% of the time, and is 0 95% of the time.然而,X1 只有大约 5% 的时间为 -1 或 1,而 95% 的时间为 0。 When I run it through feature selection techniques like Sequential Feature Selection, it doesn't get chosen, And I can understand why ML doesn't choose it.当我通过诸如顺序特征选择之类的特征选择技术运行它时,它没有被选择,我可以理解为什么 ML 不选择它。 because 95% of the time it is a 0 (and thus uncorrelated with y), And so for any score that I've come across.因为 95% 的时间它是 0(因此与 y 不相关),所以对于我遇到的任何分数。 models with X1 don't score well. X1 的模型得分不高。

So my question is more generically, how can one deal with this paradox between ML technique and real-life logic?所以我的问题更笼统,如何处理 ML 技术和现实生活逻辑之间的这种悖论? What can I do differently in ML feature selection/modelling to take advantage of the information embedded in the X1 -1's and 1's, which I know (in reality) are highly predictive?我可以在 ML 特征选择/建模中做些什么不同的事情,以利用嵌入在 X1 -1 和 1 中的信息,我知道(实际上)这些信息是高度可预测的? What feature selection technique would have spotted the predictive power of X1, if we didn't know anything about it?如果我们对此一无所知,什么特征选择技术会发现 X1 的预测能力? So far, all methods that I know of need predictive power to be unconditional.到目前为止,我所知道的所有方法都需要无条件的预测能力。 Instead, here X1 is highly predictive conditional on not being 0 (which is only 5% of the time).相反,这里 X1 是高度可预测的,条件是不为 0(只有 5% 的时间)。 What methods are out there to capture this?有什么方法可以捕捉到这一点?

Many thanks for any insight!非常感谢您的任何见解!

Probably sklearn.feature_selection.RFE would be a good option, since it is not really dependant on the feature selection method.可能sklearn.feature_selection.RFE将是一个不错的选择,因为它并不真正依赖于特征选择方法。 What I mean by that, is that it recursively fits the estimator you're planning to use and smaller on smaller subsets of features, and recursively removes features with the lowest scores until a desired amount of features is reached.我的意思是,它递归地适合您计划使用的估计器,并且在较小的特征子集上更小,并递归地删除得分最低的特征,直到达到所需的特征数量。

This seems like a good appraoch, since regardless of whether the feature in question seems more or less of a good predictor to you, this feature selection method tells you how important the feature is to the model .这似乎是一个很好的方法,因为无论所讨论的特征对您来说似乎或多或少是一个好的预测指标,这种特征选择方法都会告诉您该特征对 model 的重要性 So if a feature is not considered, it is not as relevant to the model in question.因此,如果不考虑某个功能,则它与所讨论的 model 无关。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM