简体   繁体   English

R 代码,用于根据特定似然度进行最大似然估计 function

[英]R code for maximum likelihood estimate from a specific likelihood function

I have been trying to generate R code for maximum likelihood estimation from a log likelihood function in a paper (equation 9 in page 609).我一直在尝试从一篇论文中的对数似然 function 生成 R 代码以进行最大似然估计(第 609 页中的方程 9)。 Authors in the paper estimated it using MATLAB, which I am not familiar with.论文中的作者使用我不熟悉的 MATLAB 估计它。 So I tried to generate codes in R.所以我尝试在 R 中生成代码。

Here is the snapshot of the log likelihood function in the paper:这是论文中对数似然 function 的快照:

在此处输入图像描述

, where , 在哪里

r : Binary decision (0 or 1) indicating infested plant(s) detection (1) or not (0). r :二元决策(0 或 1)指示受侵染植物的检测(1)或不检测(0)。

e : Inspection efficiency. e : 检查效率。 This is known.这是众所周知的。

n : Sample size n :样本大小

The overall objective is to estimate plant infestation rate (gamma: γ) and epsilon ( e ) based on binary decision of presence and absence of infested plants instead of using infested plant(s) detected.总体目标是基于是否存在受侵染植物的二元决策而不是使用检测到的受侵染植物来估计植物侵染率 (gamma: γ) 和 epsilon ( e )。 So, the function has only binary information ( r ) of infested plant detection and sample size.因此,function 仅具有受侵染植物检测和样本大小的二进制信息 ( r )。 Since epsilon ( e ) is known or fixed, the actual goal is to estimate gamma (γ) in a population.由于 epsilon ( e ) 是已知的或固定的,因此实际目标是估计总体中的 gamma (γ)。

Another objective is to compare estimated infestation rates from above with ones in hypergeometric sampling formula in another paper (in page 6).另一个目标是将上面估计的侵扰率与另一篇论文(第 6 页)中的超几何采样公式中的侵扰率进行比较。 The formula is:公式为:

在此处输入图像描述

This formula generates required sample size to detect infested plants with selected probability (eg, 95) given an infested rate.该公式生成所需的样本量,以在给定受侵染率的情况下以选定的概率(例如,95)检测受侵染植物。 For example:例如:

# Sample size calculation function
fosgate.sample1 <- function(box, p, ci){ # Note: box represent total plant number
  ninf <- p*box
  sample.size <- round(((1-(1-ci)^(1/ninf))*(box-(ninf-1)/2)))
  #sample.size <- ceiling(((1-(1-ci)^(1/ninf))*(box-(ninf-1)/2)))
  sample.size
}

fosgate.sample1(box=100, p = .05, ci = .95) # where box: population or total plants, p: infestation rate, and ci: probability of detection
## 44

The idea is if sample size (eg, 44) and binary decision data are provided the log-likelihood function can be used to estimate infestation rate and the rate may be close to anticipated rate (eg, .05).这个想法是,如果提供样本大小(例如,44)和二元决策数据,则可以使用对数似然 function 来估计侵扰率,并且该率可能接近预期率(例如,0.05)。 Ultimately, I would like to compare plant infestation rates (gamma: γ) estimated from the log likelihood function above and D/N in the sample size calculation formula (second) or p in the sample size code below.最后,我想比较从上面的对数似然 function 和样本量计算公式(第二个)中的 D/N 或下面样本量代码中的 p 估计的植物侵染率(γ:γ)。

I generated R code for the log-likelihood described above.我为上述对数似然生成了 R 代码。

### MLE with stat4
library(stats4)
# Log-likelihood function
plant.inf.lik <- function(inf.rate){
logl <- suppressWarnings(
        sum((1-insp.result)*n*log(1-inf.rate) + 
            insp.result*log(1-(1-inf.rate)^n))
        )
return(-logl)
}

Using the sample size function (ie, fosgate.sample1) I generated sample sizes for various cases of total plant (or box) and anticipated detection rate (p) in the function.使用样本量 function(即 fosgate.sample1),我为 function 中的总植物(或盒子)和预期检出率 (p) 的各种情况生成了样本量。 Since I am also interested in error/confidence ranges of estimated plant infestation rates, I used bootstrapping to calculate range of estimates (I am not sure if this is appropriate/acceptable).由于我也对估计的植物侵染率的错误/置信范围感兴趣,我使用自举法来计算估计范围(我不确定这是否合适/可接受)。 Here is the final code I generated:这是我生成的最终代码:

### MLE and CI with bootstrapping with multiple scenarios
plant <- c(100, 500, 1000, 5000, 10000, 100000) # Total plant number
ir <- seq(.01, .2, by = .01) # Plant infestation rate
df.result <- data.frame(expand.grid(plant=plant, inf.rate = ir))
df.result$sample.size <- fosgate.sample1(box=df.result$plant, p=df.result$inf.rate, ci=.95) # Sample size
df.result$insp.result <- 1000 # Shipment number (can be replaced with random integers)
df.result <- df.result[order(df.result$plant, df.result$inf.rate, df.result$sample.size), ]
rownames(df.result) <- 1:nrow(df.result)
df.result$est.mean <- 0
#df.result$est.median <- 0
df.result$est.lower.ci <- 0
df.result$est.upper.ci <- 0
df.result$nsim <- 0
str(df.result)
head(df.result)

# Looping
    est <- rep(NA, 1000)
for(j in 1:nrow(df.result)){
    for(i in 1:1000){
        insp.result <- sample(c(rep(1, df.result$insp.result[j]-df.result$insp.result[j]*df.result$inf.rate[j]), 
                    rep(0, df.result$insp.result[j]*df.result$inf.rate[j])))
        ir <- df.result$inf.rate[j]
        n <- df.result$sample.size[j]
        insp.result <- sample(insp.result, replace = TRUE)
        est[i] <- mle(plant.inf.lik, start = list(inf.rate = ir*.9), method = "BFGS", nobs = length(insp.result))@coef
    df.result$est.mean[j] <- mean(est, na.rm = TRUE)
#   df.result$est.median[j] <- median(est, na.rm = TRUE)
    df.result$est.lower.ci[j] <- quantile(est, prob = .025, na.rm = TRUE)
    df.result$est.upper.ci[j] <- quantile(est, prob = .975, na.rm = TRUE)
    df.result$nsim[j] <- length(est)
    }
}

# Significance test result
sig <- ifelse(df.result$inf.rate >= df.result$est.lower.ci & df.result$inf.rate <= df.result$est.upper.ci, "no sig", "sig")
table(sig)

# Plot
library(ggplot2)
library(reshape2)
df.result$num <- ave(df.result$inf.rate, df.result$plant, FUN=seq_along)
df.result.m <- melt(df.result, id.vars=c("plant", "sample.size", "insp.result", "est.lower.ci", "est.upper.ci", "nsim", "num"))
df.result.m$est.lower.ci <- ifelse(df.result.m$variable == "inf.rate", NA, df.result.m$est.lower.ci)
df.result.m$est.upper.ci <- ifelse(df.result.m$variable == "inf.rate", NA, df.result.m$est.upper.ci)
str(df.result.m)

ggplot(data = df.result.m, aes(x = num, y = value, group=variable, color=variable, shape=variable))+
    geom_point()+
    geom_errorbar(aes(ymin = est.lower.ci, ymax = est.upper.ci), width=.5)+
    scale_y_continuous(breaks = seq(0, .2, .02))+
    xlab("Index")+
    ylab("Plant infestation rate")+
    facet_wrap(~plant, ncol = 3)

When I ran the code, I was able to obtain results and to compare estimated (est.mean) and anticipated (inf.rate) infestation rates as shown in the plot below.当我运行代码时,我能够获得结果并比较估计的(est.mean)和预期的(inf.rate)感染率,如下面的 plot 所示。

在此处输入图像描述

If results are correct, plot indicates that estimation looks fine but off for greater infestation rates.如果结果正确,plot 表示估计看起来不错,但对于更高的侵扰率则关闭。

Also, I always got warning messages without "suppressWarnings" function and occasionally error messages below.此外,我总是收到没有“suppressWarnings”function 的警告消息,偶尔还会收到以下错误消息。 I have no clue how to fix them.我不知道如何解决它们。

## Warning messages
## 29: In log(1 - (1 - inf.rate)^n) : NaNs produced
## 30: In log(1 - inf.rate) : NaNs produced

## Error message (occasionally)
## Error in solve.default(oout$hessian) : 
## Lapack routine dgesv: system is exactly singular: U[1,1] = 0

My questions are:我的问题是:

  • Is R function (plant.inf.lik) for maximum likelihood estimation of the log-likelihood function appropriate? R function (plant.inf.lik) 是否适用于对数似然 function 的最大似然估计?
  • Should I take care of warning and error messages?我应该注意警告和错误消息吗? If yes, how?如果是,如何? Again, I have no clue how to fix...再说一次,我不知道如何解决......
  • Is bootstrapping (resampling?) method appropriate to estimate CI ranges and/or standard error?自举(重采样?)方法是否适合估计 CI 范围和/或标准误差?

I found this link useful for alternative approach.我发现此链接对替代方法很有用。 Although I am still working both approaches together, results seem different (maybe following question).尽管我仍在同时使用这两种方法,但结果似乎有所不同(可能是以下问题)。

Any suggestion would be greatly appreciated.任何建议将不胜感激。

Concerning your last question about estimating CI ranges, there are three common methods for ML estimators:关于估计 CI 范围的最后一个问题,ML 估计器有三种常用方法:

  1. Variance estimation from the inverted Hessian matrix.来自倒置 Hessian 矩阵的方差估计。
  2. Jackknife estimator for the variance (simpler and more stable, if the Hessian is estimated numerically, but computationally more expensive)方差的 Jackknife 估计器(更简单、更稳定,如果 Hessian 是用数值估计的,但计算成本更高)
  3. Bootstrap CIs (the computatianally most expensive approach). Bootstrap CI(计算上最昂贵的方法)。

For bootstrap CIs, you do not need to implement them yourself (bias correction, eg can be tricky), but can rely on the R library boot .对于引导 CI,您不需要自己实现它们(偏差校正,例如可能很棘手),但可以依赖 R 库boot

Incidentally, I have written a summary with R code for all three approaches two years ago: Construction of Confidence Intervals (see section 5).顺便说一句,两年前我用 R 代码为所有三种方法编写了一个摘要:置信区间的构造(参见第 5 节)。 For the method utilizing the Hessian Matrix, eg, the outline is as follows:对于利用Hessian Matrix的方法,例如,大纲如下:

lnL <- function(theta1, theta2, ...) {
  # definition of the negative (!)
  # log-likelihood function...
}

# starting values for the optimization
theta0 <- c(start1, start2, ...)

# optimization
p <- optim(theta0, lnL, hessian=TRUE)
if (p$convergence == 0) {
  theta <- p$par
  covmat <- solve(p$hessian)
  sigma <- sqrt(diag(covmat))
}

The function mle from stats4 already wraps the covrainace matrix estimation and retruns it in vcov . stats4 中的function mle已经包装了 covrainace 矩阵估计并在vcov中重新运行它。 In the practical use cases in which I have tried this (paired comparison models), though, this estimation was rather unstable, and I have resorted to the jackknife method instead.但是,在我尝试过的实际用例(配对比较模型)中,这种估计相当不稳定,因此我采用了折刀法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM