[英]Feature importance using lightgbm
I am trying to run my lightgbm for feature selection as below;我正在尝试运行我的 lightgbm 进行功能选择,如下所示;
initialization初始化
# Initialize an empty array to hold feature importances
feature_importances = np.zeros(features_sample.shape[1])
# Create the model with several hyperparameters
model = lgb.LGBMClassifier(objective='binary',
boosting_type = 'goss',
n_estimators = 10000, class_weight ='balanced')
then i fit the model as below然后我适合模型如下
# Fit the model twice to avoid overfitting
for i in range(2):
# Split into training and validation set
train_features, valid_features, train_y, valid_y = train_test_split(train_X, train_Y, test_size = 0.25, random_state = i)
# Train using early stopping
model.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)],
eval_metric = 'auc', verbose = 200)
# Record the feature importances
feature_importances += model.feature_importances_
but i get the below error但我收到以下错误
Training until validation scores don't improve for 100 rounds.
Early stopping, best iteration is: [6] valid_0's auc: 0.88648
ValueError: operands could not be broadcast together with shapes (87,) (83,) (87,)
An example for getting feature importance in lightgbm
when using train
model.使用
train
模型时在lightgbm
获取特征重要性的示例。
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
def plotImp(model, X , num = 20, fig_size = (40, 20)):
feature_imp = pd.DataFrame({'Value':model.feature_importance(),'Feature':X.columns})
plt.figure(figsize=fig_size)
sns.set(font_scale = 5)
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value",
ascending=False)[0:num])
plt.title('LightGBM Features (avg over folds)')
plt.tight_layout()
plt.savefig('lgbm_importances-01.png')
plt.show()
Depending on whether we trained the model
using scikit-learn
or lightgbm
methods, to get importance we should choose respectively feature_importances_
property or feature_importance()
function, like in this example (where model
is a result of lgbm.fit() / lgbm.train()
, and train_columns = x_train_df.columns
):根据我们是使用
scikit-learn
还是lightgbm
方法训练model
,为了获得重要性,我们应该分别选择feature_importances_
属性或feature_importance()
函数,就像在这个例子中一样(其中model
是lgbm.fit() / lgbm.train()
和train_columns = x_train_df.columns
):
import pandas as pd
def get_lgbm_varimp(model, train_columns, max_vars=50):
if "basic.Booster" in str(model.__class__):
# lightgbm.basic.Booster was trained directly, so using feature_importance() function
cv_varimp_df = pd.DataFrame([train_columns, model.feature_importance()]).T
else:
# Scikit-learn API LGBMClassifier or LGBMRegressor was fitted,
# so using feature_importances_ property
cv_varimp_df = pd.DataFrame([train_columns, model.feature_importances_]).T
cv_varimp_df.columns = ['feature_name', 'varimp']
cv_varimp_df.sort_values(by='varimp', ascending=False, inplace=True)
cv_varimp_df = cv_varimp_df.iloc[0:max_vars]
return cv_varimp_df
Note that we rely on the assumption that feature importance values are ordered just like the model matrix columns were ordered during training (incl. one-hot dummy cols), see LightGBM #209 .请注意,我们依赖于这样一个假设,即特征重要性值的排序就像训练期间模型矩阵列的排序(包括 one-hot dummy cols),参见LightGBM #209 。
For the LightGBM's 3.1.1 version, extending the comment of @user3067175 :对于 LightGBM 的 3.1.1 版本,扩展@user3067175 的评论:
pd.DataFrame({'Value':model.feature_importance(),'Feature':features}).sort_values(by="Value",ascending=False)
is a list of feature names,within the same order of your dataset, can be replaced by features = df_train.columns.tolist()
.是一个特征名称列表,在您的数据集的相同顺序内,可以替换为
features = df_train.columns.tolist()
。 This should return the feature importance with the same order of plot.这应该以相同的绘图顺序返回特征重要性。
Note: If you use LGBMRegressor, you should use注意:如果你使用 LGBMRegressor,你应该使用
pd.DataFrame({'Value':model.feature_importances_,'Feature':features}).sort_values(by="Value",ascending=False)
If you want to examine a loaded model that you don't have the training data, you can get feature importance and the feature name by如果要检查没有训练数据的加载模型,可以通过以下方式获取特征重要性和特征名称
df_feature_importance = (
pd.DataFrame({
'feature': model.feature_name(),
'importance': model.feature_importance(),
})
.sort_values('importance', ascending=False)
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.