简体   繁体   English

截断泊松与跨栏模型:为什么不同的值?

[英]Truncated Poisson vs Hurdle model: why different values?

I want to do a regression with count data model, where doctor visits is the dependent variable.我想用计数数据模型进行回归,其中医生访问是因变量。 I did a two-part model with first a probit model for no doctor visit at all or one or more and then a Poisson model for at least one doctor visit.我做了一个由两部分组成的模型,首先是完全没有就诊或一次或多次就诊的概率模型,然后是至少一次就诊的泊松模型。 After that, I did a hurdle model as a robustness check, because as far as I know, I should get very similar values for both approaches.在那之后,我做了一个障碍模型作为稳健性检查,因为据我所知,我应该得到两种方法非常相似的值。 I do get nearly the same values for the probit part.我确实得到了概率部分几乎相同的值。 I get, however, very different values for the Poisson part.然而,我得到的泊松部分的值非常不同。 Does anyone have any idea why?有谁知道为什么? Here are the commands I used:以下是我使用的命令:

probit_doc <- glm(docbin ~ phi + gender +  age + health + educ + smoke + logthinc + 
                    wave + AUS + GER + SWE + NED + ESP + ITA + FRA + DEN + GRE + 
                    SWI + BEL + ISR + CZE + POL + LUX + HUN + POR + SVN + EST + 
                    CRO + LIT + BUL + CYP + FIN + LVA + MAL + ROM, 
                  data=allwaves, family=binomial(link="probit"))
  
poisson_doc <- glm(I(doc > 0) ~ phi + gender + age + health + educ + smoke + 
                     logthinc + wave + AUS + GER + SWE + NED + ESP + ITA + FRA +
                     DEN + GRE + SWI + BEL + ISR + CZE + POL + LUX + HUN + POR +
                     SVN + EST + CRO + LIT + BUL + CYP + FIN + LVA + MAL + ROM,
                   data=allwaves, family="poisson")
  
hd_doc <- hurdle(doc ~ phi + gender + age + educ + smoke + logthinc + wave + 
                   AUS + GER + SWE + NED + ESP + ITA + FRA + DEN + GRE + SWI +
                   BEL + ISR + CZE + POL + LUX + HUN + POR + SVN + EST + CRO +
                   LIT + BUL + CYP + FIN + LVA + MAL + ROM | phi + gender +  
                   age + health + educ + smoke + logthinc + wave + AUS + GER + 
                   SWE + NED + ESP + ITA + FRA + DEN + GRE + SWI + BEL + ISR + 
                   CZE + POL + LUX + HUN + POR + SVN + EST + CRO + LIT + BUL + 
                   CYP + FIN + LVA + MAL + ROM,
                 dist="poisson", data=allwaves, zero.dist="binomial", link="probit")

In a hurdle model we usually calculate a zero part using a binomial model for the "hurdle", and a count part for what happens after the "hurdle" was passed.在障碍模型中,我们通常使用binomial模型计算“障碍”的部分,并计算“障碍”通过后发生的情况的计数部分。 Accordingly, in the zero part we binarize the outcome on if it is unequal to zero, as you did correctly.因此,在零部分中,我们对结果是否不等于零进行二值化,正如您所做的那样。

In the count part however, only the observations that have passed the hurdle are taken into account.然而,在计数部分,只考虑通过了障碍的观察。 So rather than manipulating the outcome variable as we did in the zero model, we want to subset the data to observations that have values unequal to zero.因此,我们不想像在零模型中那样操纵结果变量,而是希望将数据subset化为值不等于零的观察值。

Consider this example from the pscl package which you are obviously using.考虑一下您显然正在使用的pscl包中的这个示例。

library(pscl)
hurdle(art ~ ., dist="poisson", zero.dist="binomial", link="probit", data=bioChemists)$coe
# $count
# (Intercept)    femWomen  marMarried        kid5         phd        ment 
#  0.67113931 -0.22858266  0.09648499 -0.14218756 -0.01272637  0.01874548 
# 
# $zero
# (Intercept)    femWomen  marMarried        kid5         phd        ment 
#  0.15420081 -0.14616546  0.19834717 -0.17380191  0.01864404  0.04433777 

We may replicate the zero part by binarizing the outcome variable I(art > 0) .我们可以通过二值化结果变量I(art > 0)来复制零部分。

## zero model
glm(I(art > 0) ~ ., family=binomial(link='probit'), bioChemists)$coe
# (Intercept)    femWomen  marMarried        kid5         phd        ment 
#  0.15420081 -0.14616546  0.19834717 -0.17380191  0.01864405  0.04433781

For the count part, if we binarize the outcome variable I(art > 0) we get wrong results.对于计数部分,如果我们对结果变量I(art > 0)进行二值化,我们会得到错误的结果。

## count model misspecified
glm(I(art > 0) ~ ., family=poisson(), bioChemists)$coe  
# (Intercept)    femWomen  marMarried        kid5         phd        ment 
# -0.52364397 -0.07020092  0.08944930 -0.08814266  0.01987790  0.01258345

Instead, we want to subset the data on observations where the outcome is non-zero, and the values already better resemble those of the hurdle .相反,我们希望对结果非零的观察数据进行subset化,并且这些值已经更好地类似于hurdle的值。

## count model correctly specified
glm(art ~ ., family=poisson(), subset(bioChemists, art > 0))$coe
# (Intercept)    femWomen  marMarried        kid5         phd        ment 
#  0.82722135 -0.16220935  0.06824070 -0.09902318 -0.01311912  0.01504273 

However, as you actually stated in your title, hurdle is calculating a truncated poisson model, which is also known as positive poisson distribution whereas you used a conventional poisson in the glm .但是,正如您在标题中实际所述, hurdle正在计算截断的泊松模型,这也称为正泊松分布,而您在glm中使用了传统的泊松。

The VGAM package provides a pospoisson family function which gives exactly what we want. VGAM包提供了一个pospoisson系列函数,它提供了我们想要的。

VGAM::vglm(art ~ ., family=VGAM::pospoisson(), subset(bioChemists, art > 0)) |> 
  coefficients()
# (Intercept)    femWomen  marMarried        kid5         phd        ment 
#  0.67113934 -0.22858262  0.09648498 -0.14218724 -0.01272657  0.01874550 

The visual comparison of the two distributions shows that positive poisson is indeed better suited to fit the count part of a hurdle model.两种分布的视觉比较表明,正泊松确实更适合拟合障碍模型的计数部分。

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM