I've coded my own Poisson likelihood function, but it is returning values that are significantly different from glm for a model with an interaction for a specific data. Notice that the function spits out exactly the same result as glm from all other data I've tried, as well as for the model without the interaction for this data.
> # Log likelihood function
> llpoi = function(X, y){
+ # Ensures X is a matrix
+ if(class(X) != "matrix") X = as.matrix(X)
+ # Ensures there's a constant
+ if(sum(X[, 1]) != nrow(X)) X = cbind(1, X)
+ # A useful scalar that I'll need below
+ k = ncol(X)
+ ## Function to be maximized
+ FUN = function(par, X, y){
+ # beta hat -- the parameter we're trying to estimate
+ betahat = par[1:k]
+ # mu hat -- the systematic component
+ muhat = X %*% betahat
+ # Log likelihood function
+ sum(muhat * y - exp(muhat))
+ }
+ # Optimizing
+ opt = optim(rep(0, k), fn = FUN, y = y, X = X, control = list(fnscale = -1), method = "BFGS", hessian = T)
+ # Results, including getting the SEs from the hessian
+ cbind(opt$par, sqrt(diag(solve(-1 * opt$hessian))))
+ }
>
> # Defining inputs
> y = c(2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 3, 1, 1, 3, 2, 2, 2, 3, 1, 2, 4, 3, 3, 3, 1, 3, 0, 2, 1, 2, 4, 1, 2, 0, 2, 1, 2, 1, 4, 1, 2, 0)
> x1 = c(8, 1, 0, 3, 3, 3, 5, 4, 0.4, 1.5, 2, 1, 1, 7, 2, 3, 0, 2, 1.5, 5, 1, 4, 5.5, 6, 3, 3, 2, 0.5, 5, 10, 3, 22, 20, 3, 20, 10, 15, 25, 15, 6, 3.5, 5, 18, 2, 15.0, 16, 24)
> x2 = c(12, 12, 12, 16, 12, 12, 12, 12, 12, 12, 12, 12, 9, 9, 12, 9, 12, 12, 9, 16, 9, 6, 12, 9, 9, 12, 12, 12, 12, 14, 14, 14, 9, 12, 9, 12, 3, 12, 9, 6, 12, 12, 12, 12, 12, 12, 9)
>
> # Results
> withmyfun = llpoi(cbind(x1, x2, x1 * x2), y)
> round(withmyfun, 2)
[,1] [,2]
[1,] 0.96 0.90
[2,] -0.05 0.09
[3,] -0.02 0.08
[4,] 0.00 0.01
> withglm = glm(y ~ x1 + x2 + x1 * x2, family = "poisson")
> round(summary(withglm)$coef[, 1:2], 2)
Estimate Std. Error
(Intercept) 1.08 0.90
x1 -0.07 0.09
x2 -0.03 0.08
x1:x2 0.00 0.01
Is this something data specific? Is it inherent to the optimization process, which will eventually diverge more significantly from glm and I just got unlucky with this data? Is it a function of using method = "BFGS" for optim?
By rescaling the right-hand side variables, the outcome improves a lot.
> library(data.table)
> setDT(tmp)
> tmp[, x1 := scale(x1)][, x2 := scale(x2)]
>
>
> withmyfun = with(tmp, llpoi(cbind(x1, x2, x1 * x2), y))
> withmyfun
[,1] [,2]
[1,] 0.57076392 0.1124637
[2,] -0.19620040 0.1278070
[3,] -0.01509032 0.1169019
[4,] 0.05636459 0.1380611
>
> withglm = glm(y ~ x1 + x2 + x1 * x2, family = "poisson", data = tmp)
> summary(withglm)$coef[, 1:2]
Estimate Std. Error
(Intercept) 0.57075132 0.1124641
x1 -0.19618199 0.1278061
x2 -0.01507467 0.1169034
x1:x2 0.05636934 0.1380621
>
So, my recommendation is, inside llpoi
, to have a procedure to normalize the variables before using optim
to the data and rescale the estimates based before the function returns the value. Your example data have too big range, which results in very small estimates of coefficients. This problem gets worse because of the relatively flat likelihood surface because of insignificant variables.
Note:
You can get very close outputs from this except for the intercept. What I meant by standardizing is something like that.
llpoi = function(X, y){
# Ensures X is a matrix
if(class(X) != "matrix") X = as.matrix(X)
# Ensures there's a constant
if(sum(X[, 1]) != nrow(X)) X = cbind(1, X)
# A useful scalar that I'll need below
avgs <- c(0, apply(X[, 2:ncol(X)], 2, mean))
sds <- c(1, apply(X[, 2:ncol(X)], 2, sd))
X<- t((t(X) - avgs)/sds)
k = ncol(X)
## Function to be maximized
FUN = function(par, X, y){
# beta hat -- the parameter we're trying to estimate
betahat = par[1:k]
# mu hat -- the systematic component
muhat = X %*% betahat
# Log likelihood function
sum(muhat * y - exp(muhat))
}
# Optimizing
opt = optim(rep(0, k), fn = FUN, y = y, X = X, control = list(fnscale = -1), method = "BFGS", hessian = T)
# Results, including getting the SEs from the hessian
cbind(opt$par, sqrt(diag(solve(-1 * opt$hessian))))/sds
}
After much research, I learned that the two results differ because glm.fit, the workhorse behind glm optimizes the function through Newton-Raphson method, while I used BFGS in my llpoi function. BFGS is faster, but less precise. The two results will be very similar on most cases, but may differ more significantly when the surface area is too flat or has too many maxima, as correctly pointed out by amatsuo_net, because the climbing algorithm used by BFGS will get stuck.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.