[英]How to get proper feature importance information when using categorical feature in h2O
When I have categorical features in my dataset, h20
implies one-hot encoding and start the training process.当我的数据集中有分类特征时,
h20
表示单热编码并开始训练过程。 When I call summary
method to see the feature importance tho, it treats each encoded categorical feature as a feature.当我调用
summary
方法来查看特征重要性时,它会将每个编码的分类特征视为一个特征。 My question is that how can I get the feature importance information for the original features?我的问题是如何获取原始特征的特征重要性信息?
#import libraries
import pandas as pd
import h2o
import random
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
#initiate h20
h2o.init(ip ='localhost')
h2o.remove_all()
#load a fake data
training_data = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/glm_test/gamma_dispersion_factor_9_10kRows.csv")
#Spesifcy the predictors (x) and the response (y). I add a dummy categorical column named "0"
myPredictors = ["abs.C1.", "abs.C2.", "abs.C3.", "abs.C4.", "abs.C5.", '0']
myResponse = "resp"
#add a dummy column consisting of random string values
train = h2o.as_list(training_data)
train = pd.concat([train, pd.DataFrame(random.choices(['ya','ne','agh','c','de'], k=len(training_data)))], axis=1)
train = h2o.H2OFrame(train)
#define linear regression method
def linearRegression(df, predictors, response):
model = H2OGeneralizedLinearEstimator(family="gaussian", lambda_ = 0, standardize = True)
model.train(x=predictors, y=response, training_frame=df)
print(model.summary)
linearRegression(train, myPredictors, myResponse)
Once I run the model, here's the summary of feature importance reported by h20
.运行 model 后,这里是
h20
报告的功能重要性摘要。
Variable Importances:
variable relative_importance scaled_importance percentage
0 abs.C5. 1.508031 1.000000 0.257004
1 abs.C4. 1.364653 0.904924 0.232569
2 abs.C3. 1.158184 0.768011 0.197382
3 abs.C2. 0.766653 0.508380 0.130656
4 abs.C1. 0.471997 0.312989 0.080440
5 0.de 0.275667 0.182799 0.046980
6 0.ne 0.210085 0.139311 0.035803
7 0.ya 0.078100 0.051789 0.013310
8 0.c 0.034353 0.022780 0.005855
Is there a method that I'd get the feature importance for column 0
.有没有一种方法可以让我获得第
0
列的特征重要性。 Note that in real, I have way more categorical feature, this is just a MWE.请注意,实际上,我有更多的分类特征,这只是一个 MWE。
As mentioned in the comments, there are several problems with this approach overall, the main of which is that importances are a tricky matter even with standard analysis, and further complicating things is likely to produce misleading results.正如评论中所提到的,这种方法总体上存在几个问题,主要是即使使用标准分析,重要性也是一件棘手的事情,进一步复杂化可能会产生误导性结果。 One-hot encoding is not too problematic with GLMs, but it could be with, say, RFs.
单热编码对于 GLM 来说问题不大,但它可能与 RF 等问题有关。 With that out of the way...
有了这个...
Given a dataframe (which model.summary
seemingly produces):给定一个 dataframe(
model.summary
似乎产生):
import pandas as pd
df = pd.DataFrame([
['abs.C5.', 1.508031, 1.000000, 0.257004],
['abs.C4.', 1.364653, 0.904924, 0.232569],
['abs.C3.', 1.158184, 0.768011, 0.197382],
['abs.C2.', 0.766653, 0.508380, 0.130656],
['abs.C1.', 0.471997, 0.312989, 0.080440],
['0.de', 0.275667, 0.182799, 0.046980],
['0.ne', 0.210085, 0.139311, 0.035803],
['0.ya', 0.078100, 0.051789, 0.013310],
['0.c', 0.034353, 0.022780, 0.005855],
], columns=['variable', 'relative_importance', 'scaled_importance', 'percentage'])
df
# variable relative_importance scaled_importance percentage orig
# 0 abs.C5. 1.508031 1.000000 0.257004 abs
# 1 abs.C4. 1.364653 0.904924 0.232569 abs
# 2 abs.C3. 1.158184 0.768011 0.197382 abs
# 3 abs.C2. 0.766653 0.508380 0.130656 abs
# 4 abs.C1. 0.471997 0.312989 0.080440 abs
# 5 0.de 0.275667 0.182799 0.046980 0
# 6 0.ne 0.210085 0.139311 0.035803 0
# 7 0.ya 0.078100 0.051789 0.013310 0
# 8 0.c 0.034353 0.022780 0.005855 0
You can add an artificial column for groupby and then aggregate (sum):您可以为groupby添加一个人工列,然后聚合(求和):
df['original_variable'] = df['variable'].apply(lambda x: x.split('.')[0])
df.groupby('original_variable')['percentage'].sum()
# orig
# 0 0.101948
# abs 0.898051
# Name: percentage, dtype: float64
Or do it in one line so not to alter the dataframe, to the same effect:或者在一行中执行,以免更改 dataframe,达到相同的效果:
df.groupby(df['variable'].apply(lambda x: x.split('.')[0]))['percentage'].sum()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.