h2o：F1 分數和其他二元分類指標缺失

Question

我能夠運行以下示例代碼並獲得 F1 分數：

import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()

# import the airlines dataset:
# This dataset is used to classify whether a flight will be delayed 'YES' or not "NO"
# original data can be found at http://www.transtats.bts.gov/
airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")

# convert columns to factors
airlines["Year"]= airlines["Year"].asfactor()
airlines["Month"]= airlines["Month"].asfactor()
airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
airlines["Cancelled"] = airlines["Cancelled"].asfactor()
airlines['FlightNum'] = airlines['FlightNum'].asfactor()

# set the predictor names and the response column name
predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
              "DayOfWeek", "Month", "Distance", "FlightNum"]
response = "IsDepDelayed"

# split into train and validation sets
train, valid = airlines.split_frame(ratios = [.8], seed = 1234)

# train your model
airlines_gbm = H2OGradientBoostingEstimator(sample_rate = .7, seed = 1234)
airlines_gbm.train(x = predictors,
                   y = response,
                   training_frame = train,
                   validation_frame = valid)

# retrieve the model performance
perf = airlines_gbm.model_performance(valid)
perf

使用 output 像這樣：

ModelMetricsBinomial: gbm
** Reported on test data. **

MSE: 0.20546330299964743
RMSE: 0.4532806007316521
LogLoss: 0.5967028742962095
Mean Per-Class Error: 0.31720065289432364
AUC: 0.7414970113257631
AUCPR: 0.7616331690362552
Gini: 0.48299402265152613

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.35417599264806404: 
NO  YES Error   Rate
0   NO  1641.0  2480.0  0.6018  (2480.0/4121.0)
1   YES 595.0   4011.0  0.1292  (595.0/4606.0)
2   Total   2236.0  6491.0  0.3524  (3075.0/8727.0)

...

然而，我的數據集並沒有以類似的方式工作，盡管看起來是相同的形式。 我的數據集目標變量也有一個二進制 label。 關於我的數據集的一些信息：

y_test.nunique()
failure    2
dtype: int64

然而，我的性能（ perf ）指標只是示例代碼的一小部分：

perf = gbm.model_performance(hf_test)
perf
ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 0.02363221438767555
RMSE: 0.1537277281028883
MAE: 0.07460874699751764
RMSLE: 0.12362377397478382
Mean Residual Deviance: 0.02363221438767555

由於其敏感性質，很難共享我的數據。 關於檢查什么的任何想法？

Answer 1

您正在訓練回歸 model ，這就是您缺少二進制分類指標的原因。 H2O 知道是否訓練回歸與分類 model 的方法是查看響應列的數據類型。

我們在 H2O 用戶指南中對此進行了解釋，但這是我們經常遇到的問題，因為它與 scikit-learn 的工作方式不同，后者使用不同的回歸與分類方法，並且不需要您考慮列類型。

y_test.nunique()
failure    2
dtype: int64

在訓練數據的響應列中，您可以執行以下操作：

train["response"] = train["response"].asfactor()

或者，當您從磁盤讀取文件時，您可以將響應列解析為“枚舉”類型，因此您不必事后對其進行轉換。 在 Python 中有一些如何做到這一點的例子。 如果響應存儲為整數，H2O 在從磁盤讀取數據時只是假設它是一個數字列，但如果響應存儲為字符串，它將正確地將其解析為分類（又名“枚舉”）列和您無需指定或轉換它。

h2o：F1 分數和其他二元分類指標缺失

問題描述

1 個解決方案

解決方案1
1 已采納 2021-02-20 06:18:04

h2o：F1 分數和其他二元分類指標缺失

問題描述

1 個解決方案

解決方案1 1 已采納 2021-02-20 06:18:04

解決方案1
1 已采納 2021-02-20 06:18:04