简体   繁体   English

XGBoost的R部分依赖图

[英]R Partial Dependence Plots for XGBoost

I've run an XGBoost on a sparse matrix and am trying to display some partial dependence plots. 我已经在稀疏矩阵上运行了XGBoost,并试图显示一些部分依赖图。 I've been using PDP package but am open to suggestions. 我一直在使用PDP软件包,但欢迎提出建议。 Below code is a reproducible example of what I'm trying to do. 下面的代码是我正在尝试做的可复制示例。

# load required packages
require(matrix)
require(xgboost)
require(pdp)

# dummy data
categorical <- c('A', 'A', 'A', 'A', 'B', 'B', 'B', 'B')
numerical <- c(1, 2, 3, 4, 1, 2, 3, 4)
target <- c(100, 200, 300, 400, 500, 600, 700, 800)
data <- data.frame(categorical, numerical, target)

# create sparse matrix and run xgb
data.sparse = sparse.model.matrix(target~.-1,data)
data.xgb <- xgboost(data=data.sparse, label=data$target, nrounds=100)

# attempt to create partial dependence plots
partial(data.xgb, pred.var="numerical", plot=TRUE, rug=TRUE, train=data, type="regression")
partial(data.xgb, pred.var="categorical", plot=TRUE, rug=TRUE, train=data, type="regression")
partial(data.xgb, pred.var="categoricalA", plot=TRUE, rug=TRUE, train=data.sparse, type="regression")
partial(data.xgb, pred.var="categoricalB", plot=TRUE, rug=TRUE, train=data.sparse, type="regression")

# confirm the model is making sensible predictions despite pdp looking odd
chk <- data[2,]
chk.sparse = sparse.model.matrix(target~.-1,chk)
chk.pred <- predict(data.xgb, chk.sparse)
print(chk.pred) # gives expected values e.g. 199.9992 for second row

Questions 问题

  1. How can I display a PDP for the categorical variable so I see A and B on the one chart rather than having a line for categoricalA 如何显示分类变量的PDP,以便在一个图表上看到A和B,而不是为categoricalA画一条线
  2. Why in this example does the model predict correct values yet the PDP on the numerical variable is flat 为什么在此示例中,模型可以预测正确的值,而数值变量上的PDP却是平坦的
  3. I'd love for someone to post some code demonstrating how cross validation and/or grid search could be implemented in the example above (assuming data was bigger) 我希望有人发布一些代码来演示如何在上面的示例中实现交叉验证和/或网格搜索(假设数据更大)

Many thanks 非常感谢

It appears you will have to output the data from partial by setting plot to FALSE and create your own plot. 似乎您需要通过将绘图设置为FALSE并创建自己的绘图来从局部输出数据。 I recommend geom_crossbar for categorical variables. 我建议将geom_crossbar用于分类变量。 I looked into the code for the partial function in pdp on Github and there is a cats argument where you are supposed to name the categorical variables but it is not used any where in the function from what I can see. 我在Github上的pdp中研究了部分函数的代码,有一个cats参数,您应该在其中命名分类变量,但是据我所知,在函数中的任何位置都没有使用它。 For cross validation and grid search use caret. 对于交叉验证和网格搜索,请使用尖号。 This is a great resource to learn how. 是学习方法的重要资源。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM