如何在sklearn中使用DictVectorizer后获得分类特征的重要性

Question

I'm using sklearn.ensemble.GradientBoostingRegressor to train a model. 我正在使用sklearn.ensemble.GradientBoostingRegressor来训练模型。

My data set includes heterogeneous variables, both numeric and categroical variables. 我的数据集包括异构变量，包括数字和类别变量。 Since sklearn does not support categroical variables, I use DictVectorizer to convert these categorical features before inputting into Regressor . 由于sklearn不支持分类变量，因此在输入到Regressor之前，我使用DictVectorizer转换这些分类特征 。 Here is a piece of my code: 这是我的一段代码：

# process numeric and categorical variables separately
lsNumericColumns = []
lsCategoricalColumns = []
for col in dfTrainingSet.columns:
    if (dfTrainingSet[col].dtype == np.object):
        lsCategoricalColumns.append(col)
    else:
        lsNumericColumns.append(col)

# numeric columns
dfNumVariables = dfTrainingSet.loc[:, lsNumericColumns]
dfNumVariables.fillna(0, inplace=True)
arrNumVariables = dfNumVariables.as_matrix()

# categorical columns
dfCateVariables = dfTrainingSet.loc[:,lsCategoricalColumns]
dfCateVariables.fillna('NA', inplace=True)
vectorizer =  DictVectorizer(sparse=False)
arrCateFeatures = vectorizer.fit_transform(dfCateVariables.T.to_dict().values())

# setup training set
arrX = np.concatenate((arrNumVariables,arrCateFeatures), axis=1)
arrY = dfData['Y'].values

Then, train the model and output the feature importance: 然后，训练模型并输出特征重要性：

# setup regressor
params = {'n_estimators':500, 'max_depth':10, 'min_samples_split':50, \
          'min_samples_leaf':50, 'learning_rate':0.05, 'loss':'lad', \
          'subsample':1.0, 'max_features':"auto"} 
gbr = GradientBoostingRegressor(**params) 

# fit
print('start to train model ...') 
gbr.fit(arrX, arrY) 
print('finish training model.')

print(gbr.feature_importances_)

This will give me a list of (feature_index, feature_importance) tuples. 这将给我一个（feature_index，feature_importance）元组的列表。 However, I find that this feature index is not the original feature index as one categorical column can be converted into several columns. 但是，我发现此功能索引不是原始功能索引，因为一个分类列可以转换为多个列。

I understand I can get vectorized feature name from DictVectorizer, but how can I find out the importance of original features? 我知道我可以从DictVectorizer获得矢量化功能名称，但是如何才能找到原始功能的重要性？

could I just sum up all the importance of vectorized features which corresponds to same original feature to get the importance of original feature? 我能否总结出与原始特征相对应的矢量化特征的所有重要性，以获得原始特征的重要性？

Answer 1

You can get the feature importances for the one-hot features with 您可以使用以下热门功能获取功能重要性

zip(vectorizer.get_feature_names(), gbr.feature_importances_)

This gives a list of (feature, importance) pairs, where features are of the form 'name=value' for categoricals and just name for originally numerical features. 这给出了（特征，重要性）对的列表，其中特征具有分类的'name=value'形式，并且仅是原始数字特征的name 。 The order of appearance in get_feature_names output is guaranteed to match the order in transform or fit_transform output. get_feature_names输出中的出现顺序保证与transform或fit_transform输出中的顺序匹配。

To be honest, I'm not sure about the feature importances for the original categoricals; 说实话，我不确定原始分类的特征重要性; I'd try taking the mean rather than the sum. 我会尝试取均值而不是总和。

如何在sklearn中使用DictVectorizer后获得分类特征的重要性

问题描述

1 个解决方案

解决方案1
5 已采纳 2014-09-04 19:51:33

如何在sklearn中使用DictVectorizer后获得分类特征的重要性

问题描述

1 个解决方案

解决方案1 5 已采纳 2014-09-04 19:51:33

解决方案1
5 已采纳 2014-09-04 19:51:33