scikit-learn和脱字符GBM结果之间的区别？

Question

I'm getting drastically different F1 scores with the same input data with scikit-learn and caret. 在scikit-learn和caret的相同输入数据下，我得到的F1分数大为不同。 Here's how I'm running a GBM model for each. 这是我为每个模型运行GBM模型的方式。

scikit-learn (F1 is default output) scikit-learn（默认输出为F1）

est = GradientBoostingClassifier(n_estimators = 4000, learning_rate = 0.1, max_depth = 5, max_features = 'log2', random_state = 0)
cv = StratifiedKFold(y = labels, n_folds = 10, shuffle = True, random_state = 0)
scores = cross_val_score(est, data, labels, scoring = 'f1', cv, n_jobs = -1)

caret (F1 must be defined and called): 脱字号（必须定义并调用F1）：

f1 <- function(data, lev = NULL, model = NULL) {
      f1_val <- F1_Score(y_pred = data$pred, y_true = data$obs, positive = lev[1])
      c("F1" = f1_val)
 }
set.seed(0)
gbm <- train(label ~ ., 
           data = data, 
           method = "gbm",
           trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3, 
                                    summaryFunction = f1, classProbs = TRUE),
           metric = "F1",
           verbose = FALSE)

From the above code, I get an F1 score of ~0.8 using scikit-learn and ~0.25 using caret. 从上面的代码中，我使用scikit-learn得到的F1得分约为0.8，而使用插入符号得到的F1得分约为0.25。 A small difference might be attributed to algorithm differences, but I must be doing something wrong with the caret modeling to get the massive difference I'm seeing here. 小差异可能归因于算法差异，但是我必须对插入符号建模做一些错误处理才能获得我在这里看到的巨大差异。 I'd prefer not to post my data set, so hopefully the issue can be diagnosed from the code. 我不希望发布数据集，因此希望可以从代码中诊断出该问题。 Any help would be much appreciated. 任何帮助将非常感激。

Answer 1

GBT is an ensemble of decision trees. GBT是决策树的集合。 The difference comes from: 区别来自：

The number of decision trees in the ensemble ( n_estimators = 4000 vs. n.trees = 100 ). 集合中决策树的数量（ n_estimators = 4000与n.trees = 100 ）。
The shape (breadth, depth) of individual decision trees ( max_depth = 5 vs. interaction.depth = 1 ). 各个决策树的形状（宽度，深度）（ max_depth = 5与interaction.depth = 1 ）。

Currently, you're comparing the F1 score of a 100 MB GradientBoostingClassifier object with a 100 kB gbm object - one GBT model contains literally thousands of times more information than the other. 当前，您正在将100 MB的 GradientBoostingClassifier对象的F1得分与100 kB的 gbm对象进行比较-一种GBT模型包含的信息实际上比另一种模型多出数千倍。

You may wish to export both models to the standardized PMML representation using sklearn2pmml and r2pmml packages, and look inside the resulting PMML files (plain text, so can be opened in any text editor) to better grasp their internal structure. 您可能希望使用sklearn2pmml和r2pmml软件包将两个模型导出为标准化PMML表示，并查看生成的PMML文件（纯文本，因此可以在任何文本编辑器中打开）以更好地掌握其内部结构。

scikit-learn和脱字符GBM结果之间的区别？

问题描述

1 个解决方案

解决方案1
1 2017-07-30 07:38:07

scikit-learn和脱字符GBM结果之间的区别？

问题描述

1 个解决方案

解决方案1 1 2017-07-30 07:38:07

解决方案1
1 2017-07-30 07:38:07