简体   繁体   English

scikit-learn和脱字符GBM结果之间的区别?

[英]Difference between scikit-learn and caret GBM results?

I'm getting drastically different F1 scores with the same input data with scikit-learn and caret. 在scikit-learn和caret的相同输入数据下,我得到的F1分数大为不同。 Here's how I'm running a GBM model for each. 这是我为每个模型运行GBM模型的方式。

scikit-learn (F1 is default output) scikit-learn(默认输出为F1)

est = GradientBoostingClassifier(n_estimators = 4000, learning_rate = 0.1, max_depth = 5, max_features = 'log2', random_state = 0)
cv = StratifiedKFold(y = labels, n_folds = 10, shuffle = True, random_state = 0)
scores = cross_val_score(est, data, labels, scoring = 'f1', cv, n_jobs = -1)

caret (F1 must be defined and called): 脱字号(必须定义并调用F1):

f1 <- function(data, lev = NULL, model = NULL) {
      f1_val <- F1_Score(y_pred = data$pred, y_true = data$obs, positive = lev[1])
      c("F1" = f1_val)
 }
set.seed(0)
gbm <- train(label ~ ., 
           data = data, 
           method = "gbm",
           trControl = trainControl(method = "repeatedcv", number = 10, repeats = 3, 
                                    summaryFunction = f1, classProbs = TRUE),
           metric = "F1",
           verbose = FALSE)

From the above code, I get an F1 score of ~0.8 using scikit-learn and ~0.25 using caret. 从上面的代码中,我使用scikit-learn得到的F1得分约为0.8,而使用插入符号得到的F1得分约为0.25。 A small difference might be attributed to algorithm differences, but I must be doing something wrong with the caret modeling to get the massive difference I'm seeing here. 小差异可能归因于算法差异,但是我必须对插入符号建模做一些错误处理才能获得我在这里看到的巨大差异。 I'd prefer not to post my data set, so hopefully the issue can be diagnosed from the code. 我不希望发布数据集,因此希望可以从代码中诊断出该问题。 Any help would be much appreciated. 任何帮助将非常感激。

GBT is an ensemble of decision trees. GBT是决策树的集合。 The difference comes from: 区别来自:

  • The number of decision trees in the ensemble ( n_estimators = 4000 vs. n.trees = 100 ). 集合中决策树的数量( n_estimators = 4000n.trees = 100 )。
  • The shape (breadth, depth) of individual decision trees ( max_depth = 5 vs. interaction.depth = 1 ). 各个决策树的形状(宽度,深度)( max_depth = 5interaction.depth = 1 )。

Currently, you're comparing the F1 score of a 100 MB GradientBoostingClassifier object with a 100 kB gbm object - one GBT model contains literally thousands of times more information than the other. 当前,您正在将100 MB的 GradientBoostingClassifier对象的F1得分与100 kB的 gbm对象进行比较-一种GBT模型包含的信息实际上比另一种模型多出数千倍。

You may wish to export both models to the standardized PMML representation using sklearn2pmml and r2pmml packages, and look inside the resulting PMML files (plain text, so can be opened in any text editor) to better grasp their internal structure. 您可能希望使用sklearn2pmmlr2pmml软件包将两个模型导出为标准化PMML表示,并查看生成的PMML文件(纯文本,因此可以在任何文本编辑器中打开)以更好地掌握其内部结构。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 scikit-learn中DictionaryLearning和MiniBatchDictionaryLearning之间的区别 - difference between DictionaryLearning and MiniBatchDictionaryLearning in scikit-learn scikit-learn 和 sklearn 之间的区别(现已弃用) - Difference between scikit-learn and sklearn (now deprecated) scikit-learn:SVC 和 SGD 有什么区别? - scikit-learn: what is the difference between SVC and SGD? scikit-learn中的CountVectorizer和CharNGramAnalyzer有什么区别? - What is the difference between CountVectorizer and CharNGramAnalyzer in scikit-learn? 高斯混合模型:Spark MLlib和scikit-learn之间的区别 - Gaussian Mixture Models: Difference between Spark MLlib and scikit-learn scikit-learn 中 predict 与 predict_proba 之间的差异 - Difference between predict vs predict_proba in scikit-learn Scikit-Learn ColumnTransformer 和 FeatureUnion 管道代码区别 - Scikit-Learn pipeline code difference between ColumnTransformer and FeatureUnion scikit学习中的precision_score与Keras中的准确性之间的差异 - Difference between accuracy_score in scikit-learn and accuracy in Keras Keras 和 Scikit-learn 的加权精度度量之间的差异 - Difference between weighted accuracy metric of Keras and Scikit-learn statsmodel OLS 和 scikit-learn 线性回归的区别 - Difference between statsmodel OLS and scikit-learn linear regression
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM