H2O sklearn 包装器：如何从中获取 H2OAutoML object 并运行 explain()？

Question

I am using h2o automl library from python with scikit-learn wrapper to create a pipeline for training my model. I follow this example , recommended by the official documentation:我正在使用来自 python 的h2o automl 库和 scikit-learn 包装器来创建一个管道来训练我的 model。我按照官方文档推荐的这个例子：

from sklearn import datasets
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

from h2o.sklearn import H2OAutoMLClassifier


X_classes_train, X_classes_test, y_classes_train, y_classes_test = train_test_split(X_classes, y_classes, test_size=0.33)

pipeline = Pipeline([
    ('polyfeat', PolynomialFeatures(degree=2)),
    ('featselect', SelectKBest(f_classif, k=5)),
    ('classifier', H2OAutoMLClassifier(max_models=10, seed=2022, sort_metric='logloss'))
])

pipeline.fit(X_classes_train, y_classes_train)
preds = pipeline.predict(X_classes_test)

So, I've trained my pipeline/model, now I want to get an H2OAutoML object out of H2OAutoMLClassifier wrapper to invoke .explain() method on it and get some insight about the features and models.所以，我已经训练了我的管道/模型，现在我想从H2OAutoMLClassifier包装器中获取一个H2OAutoML object 来调用它的.explain()方法并获得一些关于功能和模型的见解。

How do I do that?我怎么做？

Answer 1

There's no easy way to use .explain() on sklearn's pipeline.在 sklearn 的管道上使用.explain()没有简单的方法。 You can extract the H2OAutoML's leader model (the best model trained in the AutoML) and on that you could call the .explain() .您可以提取 H2OAutoML 的领导者 model（在 AutoML 中训练的最好的 model），然后您可以调用.explain() 。

For .explain() to work you'll need an H2OFrame with the same features as was used to train the model and that's the problem for both interpretability and ease of use.为了.explain()工作，您需要一个具有与用于训练 model 相同功能的 H2OFrame，这就是可解释性和易用性的问题。 You will need to create the dataset using the first 2 steps in the pipeline (in your example polyfeat and featselect ).您将需要使用管道中的前 2 个步骤（在您的示例polyfeat和featselect ）创建数据集。 This alone will make it very hard to interpret - the columns will get names like C1 , C2 , ...仅这一点就很难解释 - 列将获得像C1 ， C2 ，...这样的名称

You can do the things I described using the following code:您可以使用以下代码执行我描述的操作：

transformed_df = X_classes_test

num_of_steps = len(pipeline.steps)

# Transform the data using the pipeline
for i in range(num_of_steps - 1):
    transformed_df = pipeline.steps[i][1].transform(transformed_df)
    
# Create the H2OFrame
h2o_frame = h2o.H2OFrame(transformed_df)
h2o_frame.columns = [c for c in pipeline.steps[num_of_steps - 1][1].estimator.leader._model_json["output"]["names"] 
                     if c != pipeline.steps[num_of_steps - 1][1].estimator.leader.actual_params["response_column"]]    
# Add the response column
h2o_frame = h2o_frame.cbind(h2o.H2OFrame(y_classes_test.to_frame()))
h2o_frame.set_name(h2o_frame.shape[1]-1, pipeline.steps[num_of_steps - 1][1].estimator.leader.actual_params["response_column"])

# Run the .explain()
pipeline.steps[num_of_steps - 1][1].estimator.leader.explain(h2o_frame)

However, I'd recommend another approach - if you need interpretability and do not need to cross-validate the whole pipeline.但是，我推荐另一种方法 - 如果您需要可解释性并且不需要交叉验证整个管道。 Use the first N-1 steps of the pipeline to create a data frame, add appropriate column names to the newly created data frame and then run h2o AutoML using the h2o api. This will make it easier to use .explain() and other interpretability related methods and you will have column names with actual meaning rather than just names based on column order.使用管道的前 N-1 个步骤创建数据框，将适当的列名添加到新创建的数据框，然后使用 h2o api 运行 h2o AutoML。这将使使用.explain()和其他可解释性变得更容易相关的方法，您将拥有具有实际含义的列名，而不仅仅是基于列顺序的名称。

H2O sklearn 包装器：如何从中获取 H2OAutoML object 并运行 explain()？

问题描述

1 个解决方案

解决方案1
2 已采纳 2022-09-30 08:15:29

H2O sklearn 包装器：如何从中获取 H2OAutoML object 并运行 explain()？

问题描述

1 个解决方案

解决方案1 2 已采纳 2022-09-30 08:15:29

解决方案1
2 已采纳 2022-09-30 08:15:29