[英]Using mob() trees (partykit package) with logistic() model
I am trying to use model-based recursive partitioning (MOB) with the mob() function (from the partykit package) to to obtain the different parameters associated to each feature depending on the optimal partition found using the logistic() regression (glm-binomial) function.我正在尝试将基于模型的递归分区(MOB)与 mob() function (来自partykit 包)一起使用,以根据使用logistic()回归(glm-二项式)function。 I had to define my model.我必须定义我的 model。
Following this example on page 7: https://cran.r-project.org/web/packages/partykit/vignettes/mob.pdf I created a logit function that estimates the values and would return the estimates etc. of the logistic() function. Following this example on page 7: https://cran.r-project.org/web/packages/partykit/vignettes/mob.pdf I created a logit function that estimates the values and would return the estimates etc. of the logistic( ) function。 However, the definition of the function does not seem to be the correct one.但是,function 的定义似乎并不正确。
library(partykit)
logit_func <- function(y, x, start = NULL, weights = NULL, offset = NULL, ...) {
glm(y ~ 0 + x, family = binomial, start = start, ...)
}
p <- mob(future~., data=sample, fit = logit_func)
... and getting the following error ...并得到以下错误
Error in model.frame.default(formula = y ~ 0 + x, drop.unused.levels = TRUE) :
invalid type (NULL) for variable 'x'
The sample dataframe is the following:样品dataframe 如下:
sample <- structure(list(future = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("0", "1"), class = "factor"), HHk = c(0.412585987717856,
1, 1, 1, 1, 1, 1, 1, 0.865684350743137, 0.685221125225357), HHd = c(0.529970735028671,
1, 1, 1, 0.611295754192343, 0.171910197073699, 0.722887386610618,
0.457585763978574, 0.517888089662373, 0.401285262785306), via_4 = structure(c(1L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
region_5 = structure(c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L), .Label = c("0", "1"), class = "factor")), row.names = c(NA,
10L), class = "data.frame")
Any clue?有什么线索吗?
Thank you:)谢谢:)
Apparently, the problem is related to the option formula
within partykit::mob
.显然,问题与partykit::mob
中的选项formula
有关。 I don't know which model do you have in mind, but you did not specify any partition variable (Z).我不知道您想到的是哪个 model,但您没有指定任何分区变量 (Z)。 The following works, but do not find any breaks tho.以下工作,但没有发现任何中断。 I assume that it is because of how small the data set is.我认为这是因为数据集很小。
The fitted model is assuming that you are fitting a model where HHk
is your regressor and the HHd
is being used as a partition variable.拟合的 model 假设您正在拟合 model ,其中HHk
是您的回归量,而HHd
被用作分区变量。
p <- mob(formula = future ~ HHk | HHd ,
data=sample,
fit = logit_func)
# Model-based recursive partitioning (logit_func)
#
# Model formula:
# future ~ HHk | HHd
#
# Fitted party:
# [1] root: n = 10
# x(Intercept) xHHk
# -1.386266 2.006611
#
# Number of inner nodes: 0
# Number of terminal nodes: 1
# Number of parameters per node: 2
# Objective function: 6.557608
In your mob()
call your formula
only has a single right-hand side of type y ~ z
- as opposed to having a two-part model on the right-hand side of type y ~ x | z
在你的mob()
调用中,你的formula
只有一个类型为y ~ z
的右侧 - 而不是在类型y ~ x | z
的右侧有一个由两部分组成的 model y ~ x | z
. y ~ x | z
。 The z
variables are the ones used for splitting/partitioning in the tree and the x
variables are the ones used as regressors in the model. z
变量是用于在树中拆分/分区的变量, x
变量是用作 model 中的回归量的变量。 (As already pointed out in the response by Álvaro.) (正如 Álvaro 在回复中已经指出的那样。)
In principle, it is fine not to have any regressors, you can simply use a constant fit (ie, intercept only model).原则上,没有任何回归器是可以的,您可以简单地使用常量拟合(即,仅截距模型)。 However, the logit_func()
you defined does not catch this case.但是,您定义的logit_func()
没有捕捉到这种情况。 There are three ways to remedy this:有三种方法可以解决这个问题:
Catch the case if(is.null(x))
inside logit_func()
and then use glm(y ~ 1, ...)
.在logit_func()
中捕获if(is.null(x))
的情况,然后使用glm(y ~ 1, ...)
。
Keep logit_func()
as it is, and specify the regression on the intercept explicitly: mob(future ~ 1 |., data=sample, fit = logit_func)
.保持logit_func()
,并明确指定截距的回归: mob(future ~ 1 |., data=sample, fit = logit_func)
。
Use the dedicted glmtree()
function rather than the general mob()
plus hand-crafted logit_func()
: glmtree(future ~., data = sample, family = binomial)
.使用专用的glmtree()
function 而不是一般的mob()
加上手工制作的logit_func()
: glmtree(future ~., data = sample, family = binomial)
。
All three will lead to the same tree but Strategy 3 is strongly preferred for a number of reasons: (a) It is readily available and does not require creating custom code.所有这三个都将导致相同的树,但出于多种原因强烈首选策略 3: (a) 它很容易获得,并且不需要创建自定义代码。 (b) The fitting function used internally is computationally more efficient (eg, avoids repetitive formula parsing etc.). (b) 内部使用的拟合 function 计算效率更高(例如,避免重复的公式解析等)。 (c) There are better methods available for the resulting tree, eg, a nicer plot()
and more options in the predict()
method. (c) 有更好的方法可用于生成的树,例如,更好的plot()
和predict()
方法中的更多选项。
Additionally, it might make sense to consider some of the explanatory variables as regressors and others as splitting variables (as suggested by Álvaro).此外,将一些解释变量视为回归变量而将其他解释变量视为分裂变量可能是有意义的(正如 Álvaro 所建议的那样)。 But this depends on the data and the application case and it's hard to make recommendations without further context.但这取决于数据和应用案例,如果没有进一步的背景,很难提出建议。
The results on your sample
data are shown below.您的sample
数据的结果如下所示。 Of course, on this small data set no splits are found but on the full data set it should hopefully work as expected.当然,在这个小数据集上没有发现分裂,但在完整的数据集上,它应该可以按预期工作。
p <- glmtree(future ~ ., data = sample, family = binomial)
p
## Generalized linear model tree (family: binomial)
##
## Model formula:
## future ~ 1 | .
##
## Fitted party:
## [1] root: n = 10
## (Intercept)
## 0.4054651
##
## Number of inner nodes: 0
## Number of terminal nodes: 1
## Number of parameters per node: 1
## Objective function (negative log-likelihood): 6.730117
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.