使用 lightgbm 的特征重要性

Question

I am trying to run my lightgbm for feature selection as below;我正在尝试运行我的 lightgbm 进行功能选择，如下所示；

initialization初始化

# Initialize an empty array to hold feature importances
feature_importances = np.zeros(features_sample.shape[1])

# Create the model with several hyperparameters
model = lgb.LGBMClassifier(objective='binary', 
         boosting_type = 'goss', 
         n_estimators = 10000, class_weight ='balanced')

then i fit the model as below然后我适合模型如下

# Fit the model twice to avoid overfitting
for i in range(2):

   # Split into training and validation set
   train_features, valid_features, train_y, valid_y = train_test_split(train_X, train_Y, test_size = 0.25, random_state = i)

   # Train using early stopping
   model.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)], 
             eval_metric = 'auc', verbose = 200)

   # Record the feature importances
   feature_importances += model.feature_importances_

but i get the below error但我收到以下错误

Training until validation scores don't improve for 100 rounds. 
Early stopping, best iteration is: [6]  valid_0's auc: 0.88648
ValueError: operands could not be broadcast together with shapes (87,) (83,) (87,)

Answer 1

An example for getting feature importance in lightgbm when using train model.使用train模型时在lightgbm获取特征重要性的示例。

import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

def plotImp(model, X , num = 20, fig_size = (40, 20)):
    feature_imp = pd.DataFrame({'Value':model.feature_importance(),'Feature':X.columns})
    plt.figure(figsize=fig_size)
    sns.set(font_scale = 5)
    sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", 
                                                        ascending=False)[0:num])
    plt.title('LightGBM Features (avg over folds)')
    plt.tight_layout()
    plt.savefig('lgbm_importances-01.png')
    plt.show()

Answer 2

Depending on whether we trained the model using scikit-learn or lightgbm methods, to get importance we should choose respectively feature_importances_ property or feature_importance() function, like in this example (where model is a result of lgbm.fit() / lgbm.train() , and train_columns = x_train_df.columns ):根据我们是使用scikit-learn还是lightgbm方法训练model ，为了获得重要性，我们应该分别选择feature_importances_属性或feature_importance()函数，就像在这个例子中一样（其中model是lgbm.fit() / lgbm.train()和train_columns = x_train_df.columns )：

import pandas as pd

def get_lgbm_varimp(model, train_columns, max_vars=50):
    
    if "basic.Booster" in str(model.__class__):
        # lightgbm.basic.Booster was trained directly, so using feature_importance() function 
        cv_varimp_df = pd.DataFrame([train_columns, model.feature_importance()]).T
    else:
        # Scikit-learn API LGBMClassifier or LGBMRegressor was fitted, 
        # so using feature_importances_ property
        cv_varimp_df = pd.DataFrame([train_columns, model.feature_importances_]).T

    cv_varimp_df.columns = ['feature_name', 'varimp']

    cv_varimp_df.sort_values(by='varimp', ascending=False, inplace=True)

    cv_varimp_df = cv_varimp_df.iloc[0:max_vars]   

    return cv_varimp_df

Note that we rely on the assumption that feature importance values are ordered just like the model matrix columns were ordered during training (incl. one-hot dummy cols), see LightGBM #209 .请注意，我们依赖于这样一个假设，即特征重要性值的排序就像训练期间模型矩阵列的排序（包括 one-hot dummy cols），参见LightGBM #209 。

Answer 3

For the LightGBM's 3.1.1 version, extending the comment of @user3067175 :对于 LightGBM 的 3.1.1 版本，扩展@user3067175 的评论：

pd.DataFrame({'Value':model.feature_importance(),'Feature':features}).sort_values(by="Value",ascending=False)

is a list of feature names,within the same order of your dataset, can be replaced by features = df_train.columns.tolist() .是一个特征名称列表，在您的数据集的相同顺序内，可以替换为features = df_train.columns.tolist() 。 This should return the feature importance with the same order of plot.这应该以相同的绘图顺序返回特征重要性。

Note: If you use LGBMRegressor, you should use注意：如果你使用 LGBMRegressor，你应该使用

pd.DataFrame({'Value':model.feature_importances_,'Feature':features}).sort_values(by="Value",ascending=False)

Answer 4

If you want to examine a loaded model that you don't have the training data, you can get feature importance and the feature name by如果要检查没有训练数据的加载模型，可以通过以下方式获取特征重要性和特征名称

df_feature_importance = (
    pd.DataFrame({
        'feature': model.feature_name(),
        'importance': model.feature_importance(),
    })
    .sort_values('importance', ascending=False)
)

使用 lightgbm 的特征重要性

问题描述

4 个解决方案

解决方案1
14 2018-12-02 08:27:11

解决方案2
5 2020-02-06 13:11:41

解决方案3
2 2021-03-07 22:51:17

解决方案4
0 2021-03-12 09:36:21

使用 lightgbm 的特征重要性

问题描述

4 个解决方案

解决方案1 14 2018-12-02 08:27:11

解决方案2 5 2020-02-06 13:11:41

解决方案3 2 2021-03-07 22:51:17

解决方案4 0 2021-03-12 09:36:21

解决方案1
14 2018-12-02 08:27:11

解决方案2
5 2020-02-06 13:11:41

解决方案3
2 2021-03-07 22:51:17

解决方案4
0 2021-03-12 09:36:21