简体   繁体   English

使用 lightgbm 的特征重要性

[英]Feature importance using lightgbm

I am trying to run my lightgbm for feature selection as below;我正在尝试运行我的 lightgbm 进行功能选择,如下所示;

initialization初始化

# Initialize an empty array to hold feature importances
feature_importances = np.zeros(features_sample.shape[1])

# Create the model with several hyperparameters
model = lgb.LGBMClassifier(objective='binary', 
         boosting_type = 'goss', 
         n_estimators = 10000, class_weight ='balanced')

then i fit the model as below然后我适合模型如下

# Fit the model twice to avoid overfitting
for i in range(2):

   # Split into training and validation set
   train_features, valid_features, train_y, valid_y = train_test_split(train_X, train_Y, test_size = 0.25, random_state = i)

   # Train using early stopping
   model.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)], 
             eval_metric = 'auc', verbose = 200)

   # Record the feature importances
   feature_importances += model.feature_importances_

but i get the below error但我收到以下错误

Training until validation scores don't improve for 100 rounds. 
Early stopping, best iteration is: [6]  valid_0's auc: 0.88648
ValueError: operands could not be broadcast together with shapes (87,) (83,) (87,) 

An example for getting feature importance in lightgbm when using train model.使用train模型时在lightgbm获取特征重要性的示例。

import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

def plotImp(model, X , num = 20, fig_size = (40, 20)):
    feature_imp = pd.DataFrame({'Value':model.feature_importance(),'Feature':X.columns})
    plt.figure(figsize=fig_size)
    sns.set(font_scale = 5)
    sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", 
                                                        ascending=False)[0:num])
    plt.title('LightGBM Features (avg over folds)')
    plt.tight_layout()
    plt.savefig('lgbm_importances-01.png')
    plt.show()

Depending on whether we trained the model using scikit-learn or lightgbm methods, to get importance we should choose respectively feature_importances_ property or feature_importance() function, like in this example (where model is a result of lgbm.fit() / lgbm.train() , and train_columns = x_train_df.columns ):根据我们是使用scikit-learn还是lightgbm方法训练model ,为了获得重要性,我们应该分别选择feature_importances_属性或feature_importance()函数,就像在这个例子中一样(其中modellgbm.fit() / lgbm.train()train_columns = x_train_df.columns ):

import pandas as pd

def get_lgbm_varimp(model, train_columns, max_vars=50):
    
    if "basic.Booster" in str(model.__class__):
        # lightgbm.basic.Booster was trained directly, so using feature_importance() function 
        cv_varimp_df = pd.DataFrame([train_columns, model.feature_importance()]).T
    else:
        # Scikit-learn API LGBMClassifier or LGBMRegressor was fitted, 
        # so using feature_importances_ property
        cv_varimp_df = pd.DataFrame([train_columns, model.feature_importances_]).T

    cv_varimp_df.columns = ['feature_name', 'varimp']

    cv_varimp_df.sort_values(by='varimp', ascending=False, inplace=True)

    cv_varimp_df = cv_varimp_df.iloc[0:max_vars]   

    return cv_varimp_df
    

Note that we rely on the assumption that feature importance values are ordered just like the model matrix columns were ordered during training (incl. one-hot dummy cols), see LightGBM #209 .请注意,我们依赖于这样一个假设,即特征重要性值的排序就像训练期间模型矩阵列的排序(包括 one-hot dummy cols),参见LightGBM #209

For the LightGBM's 3.1.1 version, extending the comment of @user3067175 :对于 LightGBM 的 3.1.1 版本,扩展@user3067175 的评论:

pd.DataFrame({'Value':model.feature_importance(),'Feature':features}).sort_values(by="Value",ascending=False)

is a list of feature names,within the same order of your dataset, can be replaced by features = df_train.columns.tolist() .是一个特征名称列表,在您的数据集的相同顺序内,可以替换为features = df_train.columns.tolist() This should return the feature importance with the same order of plot.这应该以相同的绘图顺序返回特征重要性。

Note: If you use LGBMRegressor, you should use注意:如果你使用 LGBMRegressor,你应该使用

pd.DataFrame({'Value':model.feature_importances_,'Feature':features}).sort_values(by="Value",ascending=False)

If you want to examine a loaded model that you don't have the training data, you can get feature importance and the feature name by如果要检查没有训练数据的加载模型,可以通过以下方式获取特征重要性和特征名称

df_feature_importance = (
    pd.DataFrame({
        'feature': model.feature_name(),
        'importance': model.feature_importance(),
    })
    .sort_values('importance', ascending=False)
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM