简体繁体 English

F_sklearn.feature_selection的回归

[英]F_Regression from sklearn.feature_selection

原文 2016-06-10 11:31:55 6 1 python/ scikit-learn/ regression/ feature-selection

I found the F_regression technique for feature selection in the sklearn feature selection module . 我在sklearn特征选择模块中找到了用于特征选择的F_regression技术。 I was not able to understand the principle it uses . 我无法理解其使用的原理。 The description given was - 给出的说明是-

Univariate linear regression tests. 单变量线性回归测试。
Quick linear model for testing the effect of a single regressor, sequentially for many regressors. 快速线性模型，用于依次测试单个回归变量对多个回归变量的影响。 This is done in 3 steps: 这通过3个步骤完成：

1.The regressor of interest and the data are orthogonalized wrt constant regressors.

1.感兴趣的回归变量和数据是与常数回归变量正交的。

2. The cross correlation between data and regressors is computed.

2.计算数据与回归变量之间的互相关。

3. It is converted to an F score then to a p-value.

3.将其转换为F分数，然后转换为p值。

I am not able to understand this , please can someone explain this in layman terms. 我无法理解，请有人用外行的方式解释一下。

1 个解决方案

The language in the docs is a little obtuse. 文档中的语言有些晦涩。 I believe 'data' refers to the response. 我相信“数据”指的是回应。 First, the chosen regressor and the response are orthogonalized with respect to the rest of the regressors. 首先，所选回归变量和响应相对于其余回归变量正交。 This reduces any multicollinearity that may be present. 这减少了可能存在的任何多重共线性。 Then, the correlation between the chosen regressor and the response is calculated. 然后，计算所选回归变量与响应之间的相关性。 In a univariate setting, the correlation coefficient is the square root of R^2, which can be written in terms of the F-statistic used in testing the overall significance of a model (see also this: https://stats.stackexchange.com/questions/56881/whats-the-relationship-between-r2-and-f-test ). 在单变量设置中，相关系数是R ^ 2的平方根，可以用检验模型整体重要性的F统计量来表示（另请参见： https：//stats.stackexchange。 com / questions / 56881 / r2-和-f-test之间的关系。 So next, the correlation is converted to an F-statistic, the corresponding p-value is calculated, and F and p are returned. 因此，接下来，将相关性转换为F统计量，计算相应的p值，并返回F和p。 If there is more than one regressor, this is done for all regressors one at a time. 如果存在多个回归器，则一次对所有回归器执行一次。