简体   繁体   中英

H2O variable importance for all discrete levels included in the model

I want to extract the individual categorical levels from a variable importance standpoint for a given model. There are several categorical predictors in the dataset supplied below, yet when I go to calculate feature importance, only the "whole column's" importance is shown, as opposed to the importances being broken up into something like C1_level0: importance and C1_level1: importance . How can I view the importance of the columns similar to what I'd see if I manually one-hot-encoded these discrete levels?

>>> import h2o
>>> h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321. connected.
--------------------------  ----------------------------------------
H2O cluster uptime:         48 mins 24 secs
H2O cluster timezone:       America/Chicago
H2O data parsing timezone:  UTC
H2O cluster version:        3.20.0.5
H2O cluster version age:    6 days
H2O cluster name:           H2O_from_python_user_9znggm
H2O cluster total nodes:    1
H2O cluster free memory:    1.464 Gb
H2O cluster total cores:    8
H2O cluster allowed cores:  8
H2O cluster status:         locked, healthy
H2O connection url:         http://localhost:54321
H2O connection proxy:
H2O internal security:      False
H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4
Python version:             3.6.5 final
--------------------------  ----------------------------------------
>>>
>>> df = h2o.create_frame(categorical_fraction=0.5)
Create Frame progress: |██████████████████████████████████████████████████████████████████████| 100%
>>>
>>> model = H2OGradientBoostingEstimator()
>>> model.train(x=[c for c in df.columns if c != 'C1'], y='C1', training_frame=df)
gbm Model Build progress: |███████████████████████████████████████████████████████████████████| 100%
>>>
>>> model.varimp(True)
  variable  relative_importance  scaled_importance  percentage
0       C3          4448.583984           1.000000    0.255125
1       C9          4424.002930           0.994474    0.253715
2       C6          4273.684082           0.960684    0.245094
3       C4          4249.320312           0.955207    0.243697
4      C10            12.800615           0.002877    0.000734
5       C7            12.022744           0.002703    0.000689
6       C8             8.271964           0.001859    0.000474
7       C2             4.649746           0.001045    0.000267
8       C5             3.567022           0.000802    0.000205

This is something that you could get with H2O's GLM when using model.std_coef_plot() , however the expected behavior of model.varimp(True) is to give you each feature's importance not the importance of individual levels.

If you want to understand the relationship between an individual level and the outcome I would recommend using H2O's partial dependence plots (documentation here and here .

What you want is called partial dependency plots and you can have it from the pdp_data = model.partial_plot(data=fi_data, cols=variable_list, plot=False, nbins=30,plot_stddev = False ) command

Inside this data table you have the information you need so after some processing you can print for each variable in the model a graph like this.

在此处输入图片说明

The red point represent the mean of Y and the dot the prediction for each level ceteris paribus

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM