I have a dataset of baseball statistics. There's 1 column for at-bats
and 1 for hits
.
My goal is to calculate the alpha
and beta
parameters for the beta distribution by using mle
method (Maximum Likelihood Estimation).
mlf1 = function (alpha, beta) {
-sum(dbetabinom.ab(data$hits, data$atbats, alpha, beta, log=T))
}
mlf2 = function (alpha, beta) {
-sum(dbeta(data$hits/data$atbats, alpha, beta, log=T))
}
So mlfX
is the function to calculate the negative log-likelihood. mlf1
uses beta-binomial distribution which means that you pass the successes ( data$hits
) and the total observations ( data$atbat
) to calculate. mlf2
uses plain beta distribution, it operates on the proportion of the aforementioned columns. They should essentially yield the same result.
I can execute the following without problems:
mle(mlf1, start=list(alpha=1, beta=10) method="L-BFGS-B")
It yields alpha
~ 74 and beta
~ 222
If I execute mle
with the second negative log-likelihood method:
mle(mlf2, start=list(alpha=1, beta=10) method="L-BFGS-B")
It gives me this:
Error in optim(start, f, method = method, hessian = TRUE, ...) :
L-BFGS-B needs finite values of 'fn'
If I modify mlf2
to filter out players with more than 30 at-bats
it starts to work.
mlf2modified = function (alpha, beta) {
data = filter(data, atbats > 30)
-sum(dbeta(data$hits/data$atbats, alpha, beta, log=T))
}
My question is why is these 2 basically identical approaches makes the optimizer behave completely differently? What can you do to avoid this if you only have proportions and do NOT want to throw out data points because the optimizer is acting up?
UPDATE:
dbetabinom.ab
is from package VGAM
, mle
is from stats4
and dbeta
is from stats
As remarked by the others you are comparing two different approaches to estimate the rates. When using the beta-distribution directly on the y/n you treat each rate as providing as much information as any of the other rate. When you use the beta-binomial you are using both information on y and n, ie 1 out of 2 provides less information about the underlying rate being equal to 50% than 100 out of 200 would.
The simplest way to estimate the rate would be to use the binomial distribution, but either because you are being Bayesian about it or because you think the observations have more variance than the binomial does (justifying the extra use of a dispersion parameter) you end up with the beta-binomial distribution. Hence, it's no surprise you get different results if your n's are not all equal.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.