简体   繁体   English

R 中的零膨胀过度分散计数数据 glmmTMB 错误

[英]zero-inflated overdispersed count data glmmTMB error in R

I am working with count data (available here ) that are zero-inflated and overdispersed and has random effects.我正在使用零膨胀和过度分散且具有随机效应的计数数据(可在此处获得)。 The package best suited to work with this sort of data is the glmmTMB (details here and troubleshootinghere ).最适合处理此类数据的 package 是glmmTMB此处为详细信息,此处为故障排除)。

Before working with the data, I inspected it for normality (it is zero-inflated), homogeneity of variance, correlations, and outliers.在处理数据之前,我检查了它的正态性(它是零膨胀的)、方差同质性、相关性和异常值。 The data had two outliers, which I removed from the dataset linekd above.数据有两个异常值,我从上面的数据集 linekd 中删除了它们。 There are 351 observations from 18 locations ( prop_id ).来自 18 个位置 ( prop_id ) 的 351 个观测值。

The data looks like this:数据如下所示:

euc0 ea_grass ep_grass np_grass np_other_grass month year precip season   prop_id quad
 3      5.7      0.0     16.7            4.0     7 2006    526 Winter    Barlow    1
 0      6.7      0.0     28.3            0.0     7 2006    525 Winter    Barlow    2
 0      2.3      0.0      3.3            0.0     7 2006    524 Winter    Barlow    3
 0      1.7      0.0     13.3            0.0     7 2006    845 Winter    Blaber    4
 0      5.7      0.0     45.0            0.0     7 2006    817 Winter    Blaber    5
 0     11.7      1.7     46.7            0.0     7 2006    607 Winter    DClark    3

The response variable is euc0 and the random effects are prop_id and quad_id .响应变量是euc0 ,随机效应是prop_idquad_id The rest of the variables are fixed effects (all representing the percent cover of different plant species).变量的 rest 是固定效应(都代表不同植物物种的覆盖百分比)。

The model I want to run:我要运行的 model:

library(glmmTMB)
seed0<-glmmTMB(euc0 ~ ea_grass + ep_grass + np_grass + np_other_grass + month + year*precip + season*precip + (1|prop_id)  + (1|quad), data = euc, family=poisson(link=identity))

fit_zinbinom <- update(seed0,family=nbinom2) #allow variance increases quadratically

The error I get after running the seed0 code is:运行seed0代码后我得到的错误是:

Error in optimHess(par.fixed, obj$fn, obj$gr): gradient in optim evaluated to length 1 not 15 In addition: There were 50 or more warnings (use warnings() to see the first 50) optimHess(par.fixed, obj$fn, obj$gr) 中的错误:optim 中的梯度评估为长度 1 而不是 15 另外:有 50 个或更多警告(使用 warnings() 查看前 50 个)

warnings() gives: warnings()给出:

1. In (function (start, objective, gradient = NULL, hessian = NULL,  ... :
  NA/NaN function evaluation

I also normally mean center and standardize my numerical variables, but this only removes the first error and keeps the NA/NaN error.我通常也指中心化和标准化我的数值变量,但这只会消除第一个错误并保持NA/NaN错误。 I tried adding a glmmTMBControl statement like this OP , but it just opened a whole new world of errors.我尝试添加一个像这个 OP这样的glmmTMBControl语句,但它只是打开了一个全新的错误世界。

How can I fix this?我怎样才能解决这个问题? What am I doing wrong?我究竟做错了什么?

A detailed explanation would be appreciated so that I can learn how to troubleshoot this better myself in the future.将不胜感激详细的解释,以便我将来可以学习如何更好地解决此问题。 Alternatively , I am open to a MCMCglmm solution as that function can also deal with this sort of data (despite taking longer to run).或者,我对MCMCglmm解决方案持开放态度,因为 function 也可以处理此类数据(尽管运行时间更长)。

An incomplete answer...一个不完整的答案...

  • identity-link models for limited-domain response distributions (eg Gamma or Poisson, where negative values are impossible) are computationally problematic;有限域响应分布的恒等链接模型(例如 Gamma 或 Poisson,其中不可能出现负值)在计算上存在问题; in my opinion they're often conceptually problematic as well, although there are some reasonable arguments in their favor.在我看来,它们通常在概念上也存在问题,尽管有一些合理的 arguments 对它们有利。 Do you have a good reason to do this?你有充分的理由这样做吗?
  • This is a pretty small data set for the model you're trying to fit: 13 fixed-effect predictors and 2 random-effect predictors.这是您尝试拟合的 model 的一个非常小的数据集:13 个固定效应预测器和 2 个随机效应预测器。 The rule of thumb would be that you want about 10-20 times that many observations: that seems to fit in OK with your 345 or so observations, but... only 40 of your observations are non-zero!经验法则是您需要大约 10 到 20 倍的观察结果:这似乎与您的 345 次左右的观察结果相符,但是……您的观察结果中只有 40 个是非零的! That means your 'effective' number of observations/amount of information will be much smaller (see Frank Harrell's Regression Modeling Strategies for more discussion of this point).这意味着您的“有效”观察次数/信息量将小得多(有关这一点的更多讨论,请参阅 Frank Harrell 的回归建模策略)。

That said, let me run through some of the things I tried and where I ended up.就是说,让我回顾一下我尝试过的一些事情以及我最终的结果。

  • GGally::ggpairs(euc, columns=2:10) doesn't detect anything obviously terrible about the data (I did throw out the data point with euc0==78 ) GGally::ggpairs(euc, columns=2:10)没有检测到任何明显可怕的数据(我确实用euc0==78丢弃了数据点)

In order to try to make the identity-link model work I added some code in glmmTMB.为了尝试使身份链接 model 工作,我在 glmmTMB 中添加了一些代码。 You should be able to install via remotes::install_github("glmmTMB/glmmTMB/glmmTMB@clamp") (note you will need compilers etc. installed to install this).您应该能够通过remotes::install_github("glmmTMB/glmmTMB/glmmTMB@clamp")安装(注意,您需要安装编译器等来安装它)。 This version takes negative predicted values and forces them to be non-negative, while adding a corresponding penalty to the negative log-likelihood.此版本采用负预测值并强制它们为非负,同时对负对数似然增加相应的惩罚。

Using the new version of glmmTMB I don't get an error, but I do get these warnings:使用新版本的 glmmTMB 我没有收到错误,但确实收到了以下警告:

Warning messages: 1: In fitTMB(TMBStruc): Model convergence problem;警告信息: 1: In fitTMB(TMBStruc): Model 收敛问题; non-positive-definite Hessian matrix.非正定 Hessian 矩阵。 See vignette('troubleshooting')见小插图('疑难解答')
2: In fitTMB(TMBStruc): Model convergence problem; 2:in fitTMB(TMBStruc):Model收敛问题; false convergence (8).错误收敛 (8)。 See vignette('troubleshooting')见小插图('疑难解答')

The Hessian (second-derivative) matrix being non-positive-definite means there are some (still hard-to-troubleshoot) problems. Hessian(二阶导数)矩阵是非正定的,意味着存在一些(仍然难以解决)问题。 heatmap(vcov(f2)$cond,Rowv=NA,Colv=NA) lets me look at the covariance matrix. heatmap(vcov(f2)$cond,Rowv=NA,Colv=NA)让我看看协方差矩阵。 (I also like corrplot::corrplot.mixed(cov2cor(vcov(f2)$cond),"ellipse","number") , but that doesn't work when vcov(.)$cond is non-positive definite. In a pinch you can use sfsmisc::posdefify() to force it to be positive definite...) (我也喜欢corrplot::corrplot.mixed(cov2cor(vcov(f2)$cond),"ellipse","number") ,但是当vcov(.)$cond是非正定的时这不起作用。在紧要关头,您可以使用sfsmisc::posdefify()强制它为正定...)

Tried scaling:尝试缩放:

eucsc <- dplyr::mutate_at(euc1,dplyr::vars(c(ea_grass:precip)), ~c(scale(.)))

This will help some - right now we're still doing a few silly things like treating year as a numeric variable without centering it (so the 'intercept' of the model is at year 0 of the Gregorian calendar...)这将对一些人有所帮助-现在我们仍在做一些愚蠢的事情,例如将年份视为数字变量而不将其居中(因此 model 的“截距”位于公历的第 0 年...)

But that still doesn't fix the problem.但这仍然不能解决问题。

Looking more closely at the ggpairs plot, it looks like season and year are confounded: with(eucsc,table(season,year)) shows that observations occur in Spring and Winter in one year and Autumn in the other year.更仔细地观察ggpairs plot,看起来seasonyear是混淆的: with(eucsc,table(season,year))表明观察发生在 Spring 和一年的冬季和另一年的秋季。 season and month are also confounded: if we know the month, then we automatically know the season. seasonmonth也被混淆了:如果我们知道月份,那么我们就会自动知道季节。

At this point I decided to give up on the identity link and see what happened.此时我决定放弃身份链接,看看发生了什么。 update(<previous_model>, family=poisson) (ie using a Poisson with a standard log link) worked! update(<previous_model>, family=poisson) (即使用带有标准日志链接的泊松)有效! So did using family=nbinom2 , which was much better.使用family=nbinom2 ,这要好得多。

I looked at the results and discovered that the CIs for the precip X season coefficients were crazy, so dropped the interaction term ( update(f2S_noyr_logNB, . ~. - precip:season) ) at which point the results look sensible.我查看了结果,发现 precip X 季节系数的 CI 很疯狂,因此删除了交互项( update(f2S_noyr_logNB, . ~. - precip:season) ),此时结果看起来很合理。

A few final notes:最后的几点说明:

  • the variance associated with quadrat is effectively zero与 quadrat 相关的方差实际上为零
  • I don't think you necessarily need zero-inflation;我认为您不一定需要零通货膨胀; low means and overdispersion (ie family=nbinom2 ) are probably sufficient.低均值和过度分散(即family=nbinom2 )可能就足够了。
  • the distribution of the residuals looks OK, but there still seems to be some model mis-fit ( library(DHARMa); plot(simulateResiduals(f2S_noyr_logNB2)) ).残差的分布看起来不错,但似乎仍有一些 model 不适合( library(DHARMa); plot(simulateResiduals(f2S_noyr_logNB2)) )。 I would spend some time plotting residuals and predicted values against various combinations of predictors to see if you can localize the problem.我会花一些时间针对各种预测变量组合绘制残差和预测值,看看您是否可以定位问题。

PS A quicker way to see that there's something wrong with the fixed effects (multicollinearity): PS 一种更快的方法来查看固定效果(多重共线性)有问题:

X <- model.matrix(~ ea_grass + ep_grass +
                   np_grass + np_other_grass + month +
                   year*precip + season*precip,
                  data=euc)
ncol(X)  ## 13
Matrix::rankMatrix(X) ## 11

lme4 has tests like this, and machinery for automatically dropping aliased columns, but they aren't implemented in glmmTMB at present. lme4有这样的测试,以及自动删除别名列的机制,但它们目前没有在glmmTMB中实现。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM