简体   繁体   English

在 Scikit 特征选择后保留特征名称

[英]Retain feature names after Scikit Feature Selection

After running a Variance Threshold from Scikit-Learn on a set of data, it removes a couple of features.在对一组数据运行 Scikit-Learn 的方差阈值后,它删除了几个特征。 I feel I'm doing something simple yet stupid, but I'd like to retain the names of the remaining features.我觉得我在做一些简单而愚蠢的事情,但我想保留其余功能的名称。 The following code:以下代码:

def VarianceThreshold_selector(data):
    selector = VarianceThreshold(.5) 
    selector.fit(data)
    selector = (pd.DataFrame(selector.transform(data)))
    return selector
x = VarianceThreshold_selector(data)
print(x)

changes the following data (this is just a small subset of the rows):更改以下数据(这只是行的一小部分):

Survived    Pclass  Sex Age SibSp   Parch   Nonsense
0             3      1  22   1        0        0
1             1      2  38   1        0        0
1             3      2  26   0        0        0

into this (again just a small subset of the rows)进入这个(再次只是行的一小部分)

     0         1      2     3
0    3      22.0      1     0
1    1      38.0      1     0
2    3      26.0      0     0

Using the get_support method, I know that these are Pclass, Age, Sibsp, and Parch, so I'd rather this return something more like :使用 get_support 方法,我知道这些是 Pclass、Age、Sibsp 和 Parch,所以我宁愿返回更像这样的东西:

     Pclass         Age      Sibsp     Parch
0        3          22.0         1         0
1        1          38.0         1         0
2        3          26.0         0         0

Is there an easy way to do this?有没有简单的方法来做到这一点? I'm very new with Scikit Learn, so I'm probably just doing something silly.我对 Scikit Learn 很陌生,所以我可能只是在做一些愚蠢的事情。

Would something like this help?这样的事情会有帮助吗? If you pass it a pandas dataframe, it will get the columns and use get_support like you mentioned to iterate over the columns list by their indices to pull out only the column headers that met the variance threshold.如果您向它传递一个 Pandas 数据框,它将获取列并使用您提到的get_support来通过它们的索引迭代列列表以仅提取满足方差阈值的列标题。

>>> df
   Survived  Pclass  Sex  Age  SibSp  Parch  Nonsense
0         0       3    1   22      1      0         0
1         1       1    2   38      1      0         0
2         1       3    2   26      0      0         0

>>> from sklearn.feature_selection import VarianceThreshold
>>> def variance_threshold_selector(data, threshold=0.5):
    selector = VarianceThreshold(threshold)
    selector.fit(data)
    return data[data.columns[selector.get_support(indices=True)]]

>>> variance_threshold_selector(df, 0.5)
   Pclass  Age
0       3   22
1       1   38
2       3   26
>>> variance_threshold_selector(df, 0.9)
   Age
0   22
1   38
2   26
>>> variance_threshold_selector(df, 0.1)
   Survived  Pclass  Sex  Age  SibSp
0         0       3    1   22      1
1         1       1    2   38      1
2         1       3    2   26      0

I came here looking for a way to get transform() or fit_transform() to return a data frame, but I suspect it's not supported.我来这里是为了寻找一种让transform()fit_transform()返回数据框的方法,但我怀疑它不受支持。

However, you can subset the data a bit more cleanly like this:但是,您可以像这样更干净地对数据进行子集化:

data_transformed = data.loc[:, selector.get_support()]

There's probably better ways to do this, but for those interested here's how I did:可能有更好的方法来做到这一点,但对于那些感兴趣的人,我是这样做的:

def VarianceThreshold_selector(data):

    #Select Model
    selector = VarianceThreshold(0) #Defaults to 0.0, e.g. only remove features with the same value in all samples

    #Fit the Model
    selector.fit(data)
    features = selector.get_support(indices = True) #returns an array of integers corresponding to nonremoved features
    features = [column for column in data[features]] #Array of all nonremoved features' names

    #Format and Return
    selector = pd.DataFrame(selector.transform(data))
    selector.columns = features
    return selector

As I had some problems with the function by Jarad, I have mixed it up with the solution by pteehan, which I found is more reliable.由于 Jarad 的功能存在一些问题,因此我将其与 pteehan 的解决方案混合在一起,我发现后者更可靠。 I also added NA replacement as a standard as VarianceThreshold does not like NA values.我还添加了 NA 替换作为标准,因为 VarianceThreshold 不喜欢 NA 值。

def variance_threshold_select(df, thresh=0.0, na_replacement=-999):
    df1 = df.copy(deep=True) # Make a deep copy of the dataframe
    selector = VarianceThreshold(thresh)
    selector.fit(df1.fillna(na_replacement)) # Fill NA values as VarianceThreshold cannot deal with those
    df2 = df.loc[:,selector.get_support(indices=False)] # Get new dataframe with columns deleted that have NA values

    return df2

how about this as a code?这个作为代码怎么样?

columns = [col for col in df.columns]

low_var_cols = []

for col in train_file.columns:
if statistics.variance(df[col]) <= 0.1:
    low_var_cols.append(col)

then drop the columns from the dataframe?然后从数据框中删除列?

您也可以使用 Pandas 进行阈值处理

data_new = data.loc[:, data.std(axis=0) > 0.75]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM