[英]Feature importance using lightgbm
我正在嘗試運行我的 lightgbm 進行功能選擇,如下所示;
初始化
# Initialize an empty array to hold feature importances
feature_importances = np.zeros(features_sample.shape[1])
# Create the model with several hyperparameters
model = lgb.LGBMClassifier(objective='binary',
boosting_type = 'goss',
n_estimators = 10000, class_weight ='balanced')
然后我適合模型如下
# Fit the model twice to avoid overfitting
for i in range(2):
# Split into training and validation set
train_features, valid_features, train_y, valid_y = train_test_split(train_X, train_Y, test_size = 0.25, random_state = i)
# Train using early stopping
model.fit(train_features, train_y, early_stopping_rounds=100, eval_set = [(valid_features, valid_y)],
eval_metric = 'auc', verbose = 200)
# Record the feature importances
feature_importances += model.feature_importances_
但我收到以下錯誤
Training until validation scores don't improve for 100 rounds.
Early stopping, best iteration is: [6] valid_0's auc: 0.88648
ValueError: operands could not be broadcast together with shapes (87,) (83,) (87,)
使用train
模型時在lightgbm
獲取特征重要性的示例。
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
def plotImp(model, X , num = 20, fig_size = (40, 20)):
feature_imp = pd.DataFrame({'Value':model.feature_importance(),'Feature':X.columns})
plt.figure(figsize=fig_size)
sns.set(font_scale = 5)
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value",
ascending=False)[0:num])
plt.title('LightGBM Features (avg over folds)')
plt.tight_layout()
plt.savefig('lgbm_importances-01.png')
plt.show()
根據我們是使用scikit-learn
還是lightgbm
方法訓練model
,為了獲得重要性,我們應該分別選擇feature_importances_
屬性或feature_importance()
函數,就像在這個例子中一樣(其中model
是lgbm.fit() / lgbm.train()
和train_columns = x_train_df.columns
):
import pandas as pd
def get_lgbm_varimp(model, train_columns, max_vars=50):
if "basic.Booster" in str(model.__class__):
# lightgbm.basic.Booster was trained directly, so using feature_importance() function
cv_varimp_df = pd.DataFrame([train_columns, model.feature_importance()]).T
else:
# Scikit-learn API LGBMClassifier or LGBMRegressor was fitted,
# so using feature_importances_ property
cv_varimp_df = pd.DataFrame([train_columns, model.feature_importances_]).T
cv_varimp_df.columns = ['feature_name', 'varimp']
cv_varimp_df.sort_values(by='varimp', ascending=False, inplace=True)
cv_varimp_df = cv_varimp_df.iloc[0:max_vars]
return cv_varimp_df
請注意,我們依賴於這樣一個假設,即特征重要性值的排序就像訓練期間模型矩陣列的排序(包括 one-hot dummy cols),參見LightGBM #209 。
對於 LightGBM 的 3.1.1 版本,擴展@user3067175 的評論:
pd.DataFrame({'Value':model.feature_importance(),'Feature':features}).sort_values(by="Value",ascending=False)
是一個特征名稱列表,在您的數據集的相同順序內,可以替換為features = df_train.columns.tolist()
。 這應該以相同的繪圖順序返回特征重要性。
注意:如果你使用 LGBMRegressor,你應該使用
pd.DataFrame({'Value':model.feature_importances_,'Feature':features}).sort_values(by="Value",ascending=False)
如果要檢查沒有訓練數據的加載模型,可以通過以下方式獲取特征重要性和特征名稱
df_feature_importance = (
pd.DataFrame({
'feature': model.feature_name(),
'importance': model.feature_importance(),
})
.sort_values('importance', ascending=False)
)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.