简体   繁体   English

基于.coef_值的百分位数使用SVC进行Scikit-Learn功能选择

[英]Scikit-Learn feature selection using SVC based on percentile of .coef_ values

I am trying to write a Python class in order to use the .coef_ attribute values to select features in scikit-learn 0.17.1. 我试图编写一个Python类,以便使用.coef_属性值来选择scikit-learn 0.17.1中的功能。 I want to only select features whose .coef_ values that lie in the 10th percentile and above (10th, 11th, 12th,13th,14th,15th,16th,....,94th,95th,96th,97th,98th, 99th, 100th). 我只想选择.coef_值在第10个百分点以上的.coef_ (第.coef_ ,....,94、95、96、97、98、99,第100个)。

I have not been able to do this with SelectFromModels() so I have tried to write a custom class named ChooseCoefPercentile() for this feature selection. 我无法使用SelectFromModels()执行此操作,因此我尝试为此功能选择编写一个名为ChooseCoefPercentile()的自定义类。 I am trying to use the following function to select the features according to percentile of .coef_ : 我正在尝试使用以下函数根据.coef_百分位数选择功能:

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(load_iris().data,
                                   load_iris().target, test_size=0.33, random_state=42)

def percentile_sep(coefs,p):
    from numpy import percentile as pc
    gt_p = coefs[coefs>pc(coefs,p)].argsort()
    return list(gt_p)

from sklearn.base import BaseEstimator, TransformerMixin
class ChooseCoefPercentile(BaseEstimator, TransformerMixin):
    def __init__(self, est_, perc=50):
        self.perc = perc
        self.est_ = est_
    def fit(self, *args, **kwargs):
        self.est_.fit(*args, **kwargs)
        return self
    def transform(self, X):
        perc_i = percentile_sep(self.est_.coef_,self.perc)
        i_ = self.est_.coef_.argsort()[::-1][perc_i[:]]
        X_tr = X[:,i_]
        self.coef_ = self.est_.coef_[i_]
        return X_tr

# Import modules
from sklearn import svm,ensemble,pipeline,grid_search

# Instantiate feature selection estimator and classifier
f_sel = ChooseCoefPercentile(svm.SVC(kernel='linear'),perc=10)
clf = ensemble.RandomForestClassifier(random_state=42,oob_score=False)

CustPipe = pipeline.Pipeline([("feat_s",f_sel),("Clf",clf)])
bf_est = grid_search.GridSearchCV(CustPipe,cv=2,param_grid={'Clf__n_estimators':[100,200]})
bf_est.fit(X_train, y_train)

I am getting the following error: 我收到以下错误:

Traceback (most recent call last):
  File "C:\Python27\test.py", line 35, in <module>
    bf_est.fit(X_train, y_train)
  File "C:\Python27\lib\site-packages\sklearn\grid_search.py", line 804, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "C:\Python27\lib\site-packages\sklearn\grid_search.py", line 553, in _fit
    for parameters in parameter_iterable
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 800, in __call__
    while self.dispatch_one_batch(iterator):
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 658, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 566, in _dispatch
    job = ImmediateComputeBatch(batch)
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 180, in __init__
    self.results = batch()
  File "C:\Python27\lib\site-packages\sklearn\externals\joblib\parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Python27\lib\site-packages\sklearn\pipeline.py", line 164, in fit
    Xt, fit_params = self._pre_transform(X, y, **fit_params)
  File "C:\Python27\lib\site-packages\sklearn\pipeline.py", line 145, in _pre_transform
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
  File "C:\Python27\lib\site-packages\sklearn\base.py", line 458, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "C:\Python27\test.py", line 21, in transform
    i_ = self.est_.coef_.argsort()[::-1][perc_i[:]]
IndexError: index 6 is out of bounds for axis 0 with size 3

It seems there is a problem with the NumPy array of .coef_ values in the following line: 似乎在以下行中的.coef_值的NumPy数组存在问题:

i_ = self.est_.coef_.argsort()[::-1][perc_i[:]]

In this line, I am trying to choose only those .coef_ values that lie above the 10th percentile based on their index. 在此行中,我尝试根据其索引仅选择位于第10个百分点以上的.coef_值。 The index is stored in a list perc_i . 索引存储在列表perc_i I cannot seem to use this list to index the .coef_ array correctly. 我似乎无法使用此列表正确索引.coef_数组。

Is this error occurring because the array needs to be divided into rows? 是否由于数组需要划分为行而发生此错误? Or should I use some other method to extract the .coef_ values based on the percentiles? 还是应该使用其他方法基于百分位数提取.coef_值?

I would suggestto compute the relevant columns for the coefficient matrix using modular arithmetic based on the number of rows: 我建议使用基于行数的模块化算法来计算系数矩阵的相关列:

def transform(self, X):
    perc_i = percentile_sep(self.est_.coef_,self.perc)
    nclass=self.est_.coef_.shape[0]
    i_ = list(set(map(lambda x:x%nclass,perc_i)))
    X_tr = X[:,i_]
    self.coef_ = self.est_.coef_[i_]
    return X_tr

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM