简体   繁体   English

最大似然估计适用于beta二项式分布,但不适用于同一数据集上的beta分布

[英]Maximum likelihood estimation works with beta-binomial distribution but fails with beta distribution on same dataset

I have a dataset of baseball statistics. 我有一个棒球统计数据集。 There's 1 column for at-bats and 1 for hits . at-bats有1列, hits有1列。

My goal is to calculate the alpha and beta parameters for the beta distribution by using mle method (Maximum Likelihood Estimation). 我的目标是使用mle方法(最大似然估计)来计算beta分布的alphabeta参数。

mlf1 = function (alpha, beta) {
    -sum(dbetabinom.ab(data$hits, data$atbats, alpha, beta, log=T))
}

mlf2 = function (alpha, beta) {
    -sum(dbeta(data$hits/data$atbats, alpha, beta, log=T))
}

So mlfX is the function to calculate the negative log-likelihood. 因此, mlfX是用于计算负对数可能性的函数。 mlf1 uses beta-binomial distribution which means that you pass the successes ( data$hits ) and the total observations ( data$atbat ) to calculate. mlf1使用beta二项式分布,这意味着您需要传递成功( data$hits )和总观测值( data$atbat )进行计算。 mlf2 uses plain beta distribution, it operates on the proportion of the aforementioned columns. mlf2使用 beta分布,它按上述列的比例运行。 They should essentially yield the same result. 它们基本上应该产生相同的结果。

I can execute the following without problems: 我可以执行以下操作而不会出现问题:

mle(mlf1, start=list(alpha=1, beta=10) method="L-BFGS-B")

It yields alpha ~ 74 and beta ~ 222 它产生alpha 〜74和beta 〜222

If I execute mle with the second negative log-likelihood method: 如果我使用第二个负对数可能性方法执行mle

mle(mlf2, start=list(alpha=1, beta=10) method="L-BFGS-B")

It gives me this: 它给了我这个:

Error in optim(start, f, method = method, hessian = TRUE, ...) : 
  L-BFGS-B needs finite values of 'fn'

If I modify mlf2 to filter out players with more than 30 at-bats it starts to work. 如果我修改mlf2来过滤掉30个以上at-bats它将开始起作用。

mlf2modified = function (alpha, beta) {
    data = filter(data, atbats > 30)
    -sum(dbeta(data$hits/data$atbats, alpha, beta, log=T))
}

My question is why is these 2 basically identical approaches makes the optimizer behave completely differently? 我的问题是,为什么这两种基本相同的方法会使优化器的行为完全不同? What can you do to avoid this if you only have proportions and do NOT want to throw out data points because the optimizer is acting up? 如果您只有比例并且不希望由于优化程序起作用而浪费数据点,该怎么办才能避免这种情况?

UPDATE: 更新:

dbetabinom.ab is from package VGAM , mle is from stats4 and dbeta is from stats dbetabinom.ab来自VGAM包, mle来自stats4dbeta来自stats

As remarked by the others you are comparing two different approaches to estimate the rates. 正如其他人所述,您正在比较两种不同的方法来估计费率。 When using the beta-distribution directly on the y/n you treat each rate as providing as much information as any of the other rate. 当直接在y / n上使用beta分布时,您将每种汇率都视为提供了与其他汇率一样多的信息。 When you use the beta-binomial you are using both information on y and n, ie 1 out of 2 provides less information about the underlying rate being equal to 50% than 100 out of 200 would. 当您使用beta二项式时,您同时使用y和n的信息,即2中的1提供的有关基础利率等于50%的信息少于200中的100。

The simplest way to estimate the rate would be to use the binomial distribution, but either because you are being Bayesian about it or because you think the observations have more variance than the binomial does (justifying the extra use of a dispersion parameter) you end up with the beta-binomial distribution. 估计速率的最简单方法是使用二项式分布,但是由于您是贝叶斯分布,或者因为您认为观测值比二项式具有更大的方差(证明对色散参数的额外使用是合理的),最终与β-二项分布。 Hence, it's no surprise you get different results if your n's are not all equal. 因此,如果n不都相等,那么得到不同的结果也就不足为奇了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM