在 Scikit 特征选择后保留特征名称

Question

After running a Variance Threshold from Scikit-Learn on a set of data, it removes a couple of features.在对一组数据运行 Scikit-Learn 的方差阈值后，它删除了几个特征。 I feel I'm doing something simple yet stupid, but I'd like to retain the names of the remaining features.我觉得我在做一些简单而愚蠢的事情，但我想保留其余功能的名称。 The following code:以下代码：

def VarianceThreshold_selector(data):
    selector = VarianceThreshold(.5) 
    selector.fit(data)
    selector = (pd.DataFrame(selector.transform(data)))
    return selector
x = VarianceThreshold_selector(data)
print(x)

changes the following data (this is just a small subset of the rows):更改以下数据（这只是行的一小部分）：

Survived    Pclass  Sex Age SibSp   Parch   Nonsense
0             3      1  22   1        0        0
1             1      2  38   1        0        0
1             3      2  26   0        0        0

into this (again just a small subset of the rows)进入这个（再次只是行的一小部分）

     0         1      2     3
0    3      22.0      1     0
1    1      38.0      1     0
2    3      26.0      0     0

Using the get_support method, I know that these are Pclass, Age, Sibsp, and Parch, so I'd rather this return something more like :使用 get_support 方法，我知道这些是 Pclass、Age、Sibsp 和 Parch，所以我宁愿返回更像这样的东西：

     Pclass         Age      Sibsp     Parch
0        3          22.0         1         0
1        1          38.0         1         0
2        3          26.0         0         0

Is there an easy way to do this?有没有简单的方法来做到这一点？ I'm very new with Scikit Learn, so I'm probably just doing something silly.我对 Scikit Learn 很陌生，所以我可能只是在做一些愚蠢的事情。

Answer 1

Would something like this help?这样的事情会有帮助吗？ If you pass it a pandas dataframe, it will get the columns and use get_support like you mentioned to iterate over the columns list by their indices to pull out only the column headers that met the variance threshold.如果您向它传递一个 Pandas 数据框，它将获取列并使用您提到的get_support来通过它们的索引迭代列列表以仅提取满足方差阈值的列标题。

>>> df
   Survived  Pclass  Sex  Age  SibSp  Parch  Nonsense
0         0       3    1   22      1      0         0
1         1       1    2   38      1      0         0
2         1       3    2   26      0      0         0

>>> from sklearn.feature_selection import VarianceThreshold
>>> def variance_threshold_selector(data, threshold=0.5):
    selector = VarianceThreshold(threshold)
    selector.fit(data)
    return data[data.columns[selector.get_support(indices=True)]]

>>> variance_threshold_selector(df, 0.5)
   Pclass  Age
0       3   22
1       1   38
2       3   26
>>> variance_threshold_selector(df, 0.9)
   Age
0   22
1   38
2   26
>>> variance_threshold_selector(df, 0.1)
   Survived  Pclass  Sex  Age  SibSp
0         0       3    1   22      1
1         1       1    2   38      1
2         1       3    2   26      0

Answer 2

I came here looking for a way to get transform() or fit_transform() to return a data frame, but I suspect it's not supported.我来这里是为了寻找一种让transform()或fit_transform()返回数据框的方法，但我怀疑它不受支持。

However, you can subset the data a bit more cleanly like this:但是，您可以像这样更干净地对数据进行子集化：

data_transformed = data.loc[:, selector.get_support()]

Answer 3

There's probably better ways to do this, but for those interested here's how I did:可能有更好的方法来做到这一点，但对于那些感兴趣的人，我是这样做的：

def VarianceThreshold_selector(data):

    #Select Model
    selector = VarianceThreshold(0) #Defaults to 0.0, e.g. only remove features with the same value in all samples

    #Fit the Model
    selector.fit(data)
    features = selector.get_support(indices = True) #returns an array of integers corresponding to nonremoved features
    features = [column for column in data[features]] #Array of all nonremoved features' names

    #Format and Return
    selector = pd.DataFrame(selector.transform(data))
    selector.columns = features
    return selector

Answer 4

As I had some problems with the function by Jarad, I have mixed it up with the solution by pteehan, which I found is more reliable.由于 Jarad 的功能存在一些问题，因此我将其与 pteehan 的解决方案混合在一起，我发现后者更可靠。 I also added NA replacement as a standard as VarianceThreshold does not like NA values.我还添加了 NA 替换作为标准，因为 VarianceThreshold 不喜欢 NA 值。

def variance_threshold_select(df, thresh=0.0, na_replacement=-999):
    df1 = df.copy(deep=True) # Make a deep copy of the dataframe
    selector = VarianceThreshold(thresh)
    selector.fit(df1.fillna(na_replacement)) # Fill NA values as VarianceThreshold cannot deal with those
    df2 = df.loc[:,selector.get_support(indices=False)] # Get new dataframe with columns deleted that have NA values

    return df2

Answer 5

how about this as a code?这个作为代码怎么样？

columns = [col for col in df.columns]

low_var_cols = []

for col in train_file.columns:
if statistics.variance(df[col]) <= 0.1:
    low_var_cols.append(col)

then drop the columns from the dataframe?然后从数据框中删除列？

Answer 6

您也可以使用 Pandas 进行阈值处理

data_new = data.loc[:, data.std(axis=0) > 0.75]

在 Scikit 特征选择后保留特征名称

问题描述

6 个解决方案

解决方案1
33 已采纳 2016-10-02 02:30:20

解决方案2
13 2016-12-08 13:56:47

解决方案3
5 2016-10-02 02:28:39

解决方案4
2 2017-04-24 12:05:38

解决方案5
1 2021-03-28 17:55:56

解决方案6
0 2020-03-27 15:04:08

在 Scikit 特征选择后保留特征名称

问题描述

6 个解决方案

解决方案1 33 已采纳 2016-10-02 02:30:20

解决方案2 13 2016-12-08 13:56:47

解决方案3 5 2016-10-02 02:28:39

解决方案4 2 2017-04-24 12:05:38

解决方案5 1 2021-03-28 17:55:56

解决方案6 0 2020-03-27 15:04:08

解决方案1
33 已采纳 2016-10-02 02:30:20

解决方案2
13 2016-12-08 13:56:47

解决方案3
5 2016-10-02 02:28:39

解决方案4
2 2017-04-24 12:05:38

解决方案5
1 2021-03-28 17:55:56

解决方案6
0 2020-03-27 15:04:08