简体   繁体   English

如何在sklearn中使用DictVectorizer后获得分类特征的重要性

[英]How to get importance of categorical feature after using DictVectorizer in sklearn

I'm using sklearn.ensemble.GradientBoostingRegressor to train a model. 我正在使用sklearn.ensemble.GradientBoostingRegressor来训练模型。

My data set includes heterogeneous variables, both numeric and categroical variables. 我的数据集包括异构变量,包括数字和类别变量。 Since sklearn does not support categroical variables, I use DictVectorizer to convert these categorical features before inputting into Regressor . 由于sklearn不支持分类变量,因此在输入到Regressor之前,我使用DictVectorizer转换这些分类特征 Here is a piece of my code: 这是我的一段代码:

# process numeric and categorical variables separately
lsNumericColumns = []
lsCategoricalColumns = []
for col in dfTrainingSet.columns:
    if (dfTrainingSet[col].dtype == np.object):
        lsCategoricalColumns.append(col)
    else:
        lsNumericColumns.append(col)

# numeric columns
dfNumVariables = dfTrainingSet.loc[:, lsNumericColumns]
dfNumVariables.fillna(0, inplace=True)
arrNumVariables = dfNumVariables.as_matrix()

# categorical columns
dfCateVariables = dfTrainingSet.loc[:,lsCategoricalColumns]
dfCateVariables.fillna('NA', inplace=True)
vectorizer =  DictVectorizer(sparse=False)
arrCateFeatures = vectorizer.fit_transform(dfCateVariables.T.to_dict().values())

# setup training set
arrX = np.concatenate((arrNumVariables,arrCateFeatures), axis=1)
arrY = dfData['Y'].values

Then, train the model and output the feature importance: 然后,训练模型并输出特征重要性:

# setup regressor
params = {'n_estimators':500, 'max_depth':10, 'min_samples_split':50, \
          'min_samples_leaf':50, 'learning_rate':0.05, 'loss':'lad', \
          'subsample':1.0, 'max_features':"auto"} 
gbr = GradientBoostingRegressor(**params) 

# fit
print('start to train model ...') 
gbr.fit(arrX, arrY) 
print('finish training model.')

print(gbr.feature_importances_)

This will give me a list of (feature_index, feature_importance) tuples. 这将给我一个(feature_index,feature_importance)元组的列表。 However, I find that this feature index is not the original feature index as one categorical column can be converted into several columns. 但是,我发现此功能索引不是原始功能索引,因为一个分类列可以转换为多个列。

I understand I can get vectorized feature name from DictVectorizer, but how can I find out the importance of original features? 我知道我可以从DictVectorizer获得矢量化功能名称,但是如何才能找到原始功能的重要性?

could I just sum up all the importance of vectorized features which corresponds to same original feature to get the importance of original feature? 我能否总结出与原始特征相对应的矢量化特征的所有重要性,以获得原始特征的重要性?

You can get the feature importances for the one-hot features with 您可以使用以下热门功能获取功能重要性

zip(vectorizer.get_feature_names(), gbr.feature_importances_)

This gives a list of (feature, importance) pairs, where features are of the form 'name=value' for categoricals and just name for originally numerical features. 这给出了(特征,重要性)对的列表,其中特征具有分类的'name=value'形式,并且仅是原始数字特征的name The order of appearance in get_feature_names output is guaranteed to match the order in transform or fit_transform output. get_feature_names输出中的出现顺序保证与transformfit_transform输出中的顺序匹配。

To be honest, I'm not sure about the feature importances for the original categoricals; 说实话,我不确定原始分类的特征重要性; I'd try taking the mean rather than the sum. 我会尝试取均值而不是总和。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM