简体   繁体   English

获得最佳的TPOT管道后获得feature_importances_吗?

[英]Getting feature_importances_ after getting optimal TPOT pipeline?

I've read through a few pages but need someone to help explain how to make this work for. 我已经阅读了几页,但需要有人帮助解释如何实现此目的。

I'm using TPOTRegressor() to get an optimal pipeline, but from there I would love to be able to plot the .feature_importances_ of the pipeline it returns: 我正在使用TPOTRegressor()获得最佳管道,但是从那里我希望能够绘制返回的管道的.feature_importances_

best_model = TPOTRegressor(cv=folds, generations=2, population_size=10, verbosity=2, random_state=seed) #memory='./PipelineCache',       memory='auto',
best_model.fit(X_train, Y_train)
feature_importance = best_model.fitted_pipeline_.steps[-1][1].feature_importances_

I saw this kind of set up from a now closed issue on Github, but currently I get the error: 我从Github上现已关闭的问题中看到了这种设置,但是当前出现错误:

Best pipeline: LassoLarsCV(input_matrix, normalize=True)

Traceback (most recent call last):
  File "main2.py", line 313, in <module>
    feature_importance = best_model.fitted_pipeline_.steps[-1][1].feature_importances_
AttributeError: 'LassoLarsCV' object has no attribute 'feature_importances_'

So, how would I get these feature importances from the optimal pipeline, regardless of which one it lands on? 那么,无论它落在哪一个管道上,我如何从最佳管道中获得这些功能的重要性? Or is this even possible? 还是有可能吗? Or does someone have a better way of going about trying to plot feature importances from a TPOT run? 还是有人试图通过TPOT运行来绘制功能重要性的更好方法?

Thanks! 谢谢!

UPDATE UPDATE

For clarification, what is meant by Feature Importance is the determination of how important each feature (X's) of your dataset is in determining the predicted (Y) label, using a barchart to plot each feature's level of importance in coming up with its predictions. 为了澄清起见,特征重要性的含义是确定数据集中每个特征(X)在确定预测(Y)标签中的重要性,并使用条形图绘制每个特征在提出预测时的重要性级别。 TPOT doesn't do this directly (I don't think), so I was thinking I'd grab the pipeline it came up with, re-run it on the training data, and then somehow use a .feature_imprtances_ to then be able to graph the feature importances, as these are all sklearn regressor's I'm using? TPOT不会直接执行此操作(我不认为),所以我认为我要抓住它带来的管道,在训练数据上重新运行它,然后以某种方式使用.feature_imprtances_来图形化功能的重要性,因为这些都是我正在使用的sklearn回归器?

Very nice question. 非常好的问题。

You just need to fit again the best model in order to get the feature importances. 您只需要再次拟合最佳模型即可获得功能的重要性。

best_model.fit(X_train, Y_train)
exctracted_best_model = best_model.fitted_pipeline_.steps[-1][1]

The last line returns the best model based on the CV. 最后一行根据CV返回最佳模型。

You can then use: 然后,您可以使用:

exctracted_best_model.fit(X_train, Y_train) 

to train it. 训练它。 If the best model has the desired attribure, then you will be able to access it after exctracted_best_model.fit(X_train, Y_train) 如果最佳模型具有所需的属性,那么您可以在exctracted_best_model.fit(X_train, Y_train)之后访问它


More details (in my comments) and a Toy example: 更多详细信息(在我的评论中)和一个Toy示例:

from tpot import TPOTRegressor
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)
# reduce training features for time sake
X_train = X_train[:100,:] 
y_train = y_train[:100]

# Fit the TPOT pipeline
tpot = TPOTRegressor(cv=2, generations=5, population_size=50, verbosity=2)

# Fit the pipeline
tpot.fit(X_train, y_train)

# Get the best model
exctracted_best_model = tpot.fitted_pipeline_.steps[-1][1]

print(exctracted_best_model)
AdaBoostRegressor(base_estimator=None, learning_rate=0.5, loss='square',
         n_estimators=100, random_state=None)

# Train the `exctracted_best_model` using THE WHOLE DATASET.
# You need to use the whole dataset in order to get feature importance for all the
# features in your dataset.
exctracted_best_model.fit(X, y) # X,y IMPORTNANT

# Access it's features
exctracted_best_model.feature_importances_

# Plot them using barplot
# Here I fitted the model on X_train, y_train and not on the whole dataset for TIME SAKE
# So I got importances only for the features in `X_train`
# If you use `exctracted_best_model.fit(X, y)` we will have importances for all the features !!!
positions= range(exctracted_best_model.feature_importances_.shape[0])
plt.bar(positions, exctracted_best_model.feature_importances_)
plt.show()

IMPORTNANT NOTE: *In the above example, the best model based on the pipeline was AdaBoostRegressor(base_estimator=None, learning_rate=0.5, loss='square') . 重要说明: *在上面的示例中,基于管道的最佳模型是AdaBoostRegressor(base_estimator=None, learning_rate=0.5, loss='square') This model indeed has the attribute feature_importances_ . 该模型确实具有feature_importances_属性。 In the case where the best model does not have an attribute feature_importances_ , the exact same code will not work. 在最佳模型没有属性feature_importances_ ,完全相同的代码将不起作用。 You will need to read the docs and see the attributes of each returned best model. 您将需要阅读文档,并查看每个返回的最佳模型的属性。 Eg . 例如 if the best model was LassoCV then you would use the coef_ attribute. 如果最好的模型是LassoCV则可以使用coef_属性。

Output: 输出:

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM