简体   繁体   English

H2O Python 因子列的 relevel 与 relevel_by_frequency

[英]H2O Python relevel vs relevel_by_frequency for factor columns

Based on H2O's documentation it would seem as though relevel('most_frequency_category') and relevel_by_frequency() should accomplish the same thing.根据 H2O 的文档,似乎relevel('most_frequency_category')relevel_by_frequency()应该完成同样的事情。 However the coefficient estimates are different depending on which method is used to set the reference level for a factor column.但是,系数估计值会有所不同,具体取决于使用哪种方法来设置因子列的参考水平。

Using an open source dataset from sklearn demonstrates how the GLM coefficients are misaligned when the base level is set using the two releveling methods.使用来自 sklearn 的开源数据集演示了在使用两种重新调平方法设置基本级别时 GLM 系数如何未对齐。 Why do the coefficient estimates vary when the base level is the same between the two models?当两个模型的基准水平相同时,为什么系数估计值不同?

import pandas as pd
from sklearn.datasets import fetch_openml

import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

h2o.init(max_mem_size=8)


def load_mtpl2(n_samples=100000):
    """
    Fetch the French Motor Third-Party Liability Claims dataset.
    https://scikit-learn.org/stable/auto_examples/linear_model/plot_tweedie_regression_insurance_claims.html
    
    Parameters
    ----------
    n_samples: int, default=100000
      number of samples to select (for faster run time). Full dataset has
      678013 samples.
    """
    # freMTPL2freq dataset from https://www.openml.org/d/41214
    df_freq = fetch_openml(data_id=41214, as_frame=True)["data"]
    df_freq["IDpol"] = df_freq["IDpol"].astype(int)
    df_freq.set_index("IDpol", inplace=True)

    # freMTPL2sev dataset from https://www.openml.org/d/41215
    df_sev = fetch_openml(data_id=41215, as_frame=True)["data"]

    # sum ClaimAmount over identical IDs
    df_sev = df_sev.groupby("IDpol").sum()

    df = df_freq.join(df_sev, how="left")
    df["ClaimAmount"].fillna(0, inplace=True)

    # unquote string fields
    for column_name in df.columns[df.dtypes.values == object]:
        df[column_name] = df[column_name].str.strip("'")
    return df.iloc[:n_samples]


df = load_mtpl2()
df.loc[(df["ClaimAmount"] == 0) & (df["ClaimNb"] >= 1), "ClaimNb"] = 0
df["Exposure"] = df["Exposure"].clip(upper=1)
df["ClaimAmount"] = df["ClaimAmount"].clip(upper=100000)
df["PurePremium"] = df["ClaimAmount"] / df["Exposure"]

X_freq = h2o.H2OFrame(df)
X_freq["VehBrand"] = X_freq["VehBrand"].asfactor()
X_freq["VehBrand"] = X_freq["VehBrand"].relevel_by_frequency()

X_relevel = h2o.H2OFrame(df)
X_relevel["VehBrand"] = X_relevel["VehBrand"].asfactor()
X_relevel["VehBrand"] = X_relevel["VehBrand"].relevel("B1") # most frequent category

response_col = "PurePremium"
weight_col = "Exposure"
predictors = "VehBrand"

glm_freq = H2OGeneralizedLinearEstimator(family="tweedie",
                                      solver='IRLSM',
                                      tweedie_variance_power=1.5,
                                      tweedie_link_power=0,
                                      lambda_=0,
                                      compute_p_values=True,
                                      remove_collinear_columns=True,
                                      seed=1)

glm_relevel = H2OGeneralizedLinearEstimator(family="tweedie",
                                      solver='IRLSM',
                                      tweedie_variance_power=1.5,
                                      tweedie_link_power=0,
                                      lambda_=0,
                                      compute_p_values=True,
                                      remove_collinear_columns=True,
                                      seed=1)

glm_freq.train(x=predictors, y=response_col, training_frame=X_freq, weights_column=weight_col)
glm_relevel.train(x=predictors, y=response_col, training_frame=X_relevel, weights_column=weight_col)

print('GLM with the reference level set using relevel_by_frequency()')
print(glm_freq._model_json['output']['coefficients_table'])
print('\n')
print('GLM with the reference level manually set using relevel()')
print(glm_relevel._model_json['output']['coefficients_table'])

Output Output

GLM with the reference level set using relevel_by_frequency()
Coefficients: glm coefficients
names         coefficients    std_error    z_value     p_value      standardized_coefficients
------------  --------------  -----------  ----------  -----------  ---------------------------
Intercept     5.40413         1.24082      4.35531     1.33012e-05  5.40413
VehBrand.B2   -0.398721       1.2599       -0.316472   0.751645     -0.398721
VehBrand.B12  -0.061573       1.46541      -0.0420176  0.966485     -0.061573
VehBrand.B3   -0.393908       1.30712      -0.301356   0.763144     -0.393908
VehBrand.B5   -0.282484       1.31929      -0.214118   0.830455     -0.282484
VehBrand.B6   -0.387747       1.25943      -0.307876   0.758177     -0.387747
VehBrand.B4   0.391771        1.45615      0.269047    0.787894     0.391771
VehBrand.B10  -0.0542706      1.35049      -0.040186   0.967945     -0.0542706
VehBrand.B13  -0.306381       1.4628       -0.209449   0.834098     -0.306381
VehBrand.B11  -0.435297       1.29155      -0.337035   0.736091     -0.435297
VehBrand.B14  -0.304243       1.34781      -0.225732   0.821411     -0.304243


GLM with the reference level manually set using relevel()
Coefficients: glm coefficients
names         coefficients    std_error    z_value     p_value     standardized_coefficients
------------  --------------  -----------  ----------  ----------  ---------------------------
Intercept     5.01639         0.215713     23.2549     2.635e-119  5.01639
VehBrand.B10  0.081366        0.804165     0.101181    0.919407    0.081366
VehBrand.B11  0.779518        0.792003     0.984237    0.325001    0.779518
VehBrand.B12  -0.0475497      0.41834      -0.113663   0.909505    -0.0475497
VehBrand.B13  0.326174        0.80891      0.403227    0.686782    0.326174
VehBrand.B14  0.387747        1.25943      0.307876    0.758177    0.387747
VehBrand.B2   -0.010974       0.306996     -0.0357465  0.971485    -0.010974
VehBrand.B3   -0.00616108     0.464188     -0.0132728  0.98941     -0.00616108
VehBrand.B4   0.333477        0.575082     0.579877    0.561999    0.333477
VehBrand.B5   0.105263        0.497431     0.211613    0.832409    0.105263
VehBrand.B6   0.0835042       0.568769     0.146816    0.883278    0.0835042

The two datasets are almost the same except at one place:除了一个地方,这两个数据集几乎相同:

In the first dataset, number of rows for VehBrand with B1 = 72 In the second dataset, number of rows for VehBrand with B14 = 721.在第一个数据集中,B1 = 72 的 VehBrand 的行数在第二个数据集中,B14 = 721 的 VehBrand 的行数。

If you look and compare the two datasets, you can map the equivalent names to the number of rows in the two dataset as follows:如果您查看并比较这两个数据集,您可以 map 两个数据集中行数的等效名称,如下所示:

Freq B2 == Relevel B2 with 26500 rows Freq B2 == Relevel B2 26500行

Freq B12 == Relevel B13 with 1883 rows Freq B12 == Relevel B13 1883行

Freq B3 == Relevel B3 with 8260 rows Freq B3 == Relevel B3 有 8260 行

Freq B5 == Relevel B5 with 6053 rows Freq B5 == Relevel B5 有 6053 行

Freq B6 == Relevel B1 with 27240 rows Freq B6 == Relevel B1 27240 行

Freq B4 == Relevel B11 with 1774 rows Freq B4 == Relevel B11 1774行

Freq B10 == Relevel B4 with 3968 rows Freq B10 == Relevel B4 3968行

Freq B13 == Relevel B10 with 2268 rows Freq B13 == Relevel B10 2268行

Freq B11 == Relevel B12 with 16619 rows Freq B11 == Relevel B12 16619行

Freq B14 == Relevel B6 with 4714 rows. Freq B14 == Relevel B6 有 4714 行。

Since you are training the two GLM models with different datasets, you will get different coefficients and different prediction results.由于您正在使用不同的数据集训练两个 GLM 模型,因此您将获得不同的系数和不同的预测结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM