简体   繁体   中英

Maximum likelihood estimation works with beta-binomial distribution but fails with beta distribution on same dataset

I have a dataset of baseball statistics. There's 1 column for at-bats and 1 for hits .

My goal is to calculate the alpha and beta parameters for the beta distribution by using mle method (Maximum Likelihood Estimation).

mlf1 = function (alpha, beta) {
    -sum(dbetabinom.ab(data$hits, data$atbats, alpha, beta, log=T))
}

mlf2 = function (alpha, beta) {
    -sum(dbeta(data$hits/data$atbats, alpha, beta, log=T))
}

So mlfX is the function to calculate the negative log-likelihood. mlf1 uses beta-binomial distribution which means that you pass the successes ( data$hits ) and the total observations ( data$atbat ) to calculate. mlf2 uses plain beta distribution, it operates on the proportion of the aforementioned columns. They should essentially yield the same result.

I can execute the following without problems:

mle(mlf1, start=list(alpha=1, beta=10) method="L-BFGS-B")

It yields alpha ~ 74 and beta ~ 222

If I execute mle with the second negative log-likelihood method:

mle(mlf2, start=list(alpha=1, beta=10) method="L-BFGS-B")

It gives me this:

Error in optim(start, f, method = method, hessian = TRUE, ...) : 
  L-BFGS-B needs finite values of 'fn'

If I modify mlf2 to filter out players with more than 30 at-bats it starts to work.

mlf2modified = function (alpha, beta) {
    data = filter(data, atbats > 30)
    -sum(dbeta(data$hits/data$atbats, alpha, beta, log=T))
}

My question is why is these 2 basically identical approaches makes the optimizer behave completely differently? What can you do to avoid this if you only have proportions and do NOT want to throw out data points because the optimizer is acting up?

UPDATE:

dbetabinom.ab is from package VGAM , mle is from stats4 and dbeta is from stats

As remarked by the others you are comparing two different approaches to estimate the rates. When using the beta-distribution directly on the y/n you treat each rate as providing as much information as any of the other rate. When you use the beta-binomial you are using both information on y and n, ie 1 out of 2 provides less information about the underlying rate being equal to 50% than 100 out of 200 would.

The simplest way to estimate the rate would be to use the binomial distribution, but either because you are being Bayesian about it or because you think the observations have more variance than the binomial does (justifying the extra use of a dispersion parameter) you end up with the beta-binomial distribution. Hence, it's no surprise you get different results if your n's are not all equal.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM