如何基于少量证据有效地估计概率？

Question

I've been trying to find an answer to this for months (to be used in a machine learning application), it doesn't seem like it should be a terribly hard problem, but I'm a software engineer, and math was never one of my strengths. 几个月来我一直试图找到答案（用于机器学习应用程序），它似乎不应该是一个非常难的问题，但我是一名软件工程师，数学从来没有我的一个优点。

Here is the scenario: 这是场景：

I have a (possibly) unevenly weighted coin and I want to figure out the probability of it coming up heads. 我有一个（可能）不均匀加权的硬币，我想弄清楚它出现的可能性。 I know that coins from the same box that this one came from have an average probability of p , and I also know the standard deviation of these probabilities (call it s ). 我知道来自同一个盒子的硬币的平均概率为p ，我也知道这些概率的标准偏差（称之为s ）。

(If other summary properties of the probabilities of other coins aside from their mean and stddev would be useful, I can probably get them too.) （如果除了他们的平均值和stddev之外的其他硬币的概率的其他概要属性将是有用的，我也可以得到它们。）

I toss the coin n times, and it comes up heads h times. 我折腾硬币n次，并出现正面H与。

The naive approach is that the probability is just h/n - but if n is small this is unlikely to be accurate. 天真的方法是概率只是h / n - 但如果n很小，这不太可能是准确的。

Is there a computationally efficient way (ie. doesn't involve very very large or very very small numbers) to take p and s into consideration to come up with a more accurate probability estimate, even when n is small? 有没有计算上有效的方法（即，不涉及非常大或非常小的数字）考虑p和s来提出更准确的概率估计，即使n很小？

I'd appreciate it if any answers could use pseudocode rather than mathematical notation since I find most mathematical notation to be impenetrable ;-) 我很感激，如果任何答案都可以使用伪代码而不是数学符号，因为我发现大多数数学符号都是难以理解的;-)

Other answers: There are some other answers on SO that are similar, but the answers provided are unsatisfactory. 其他答案：关于SO的其他一些答案是相似的，但提供的答案并不令人满意。 For example this is not computationally efficient because it quickly involves numbers way smaller than can be represented even in double-precision floats. 例如，这在计算上并不高效，因为它快速涉及的数字方式比可以表示的数字小，即使在双精度浮点数中也是如此。 And this one turned out to be incorrect. 结果证明这是不正确的。

Answer 1

Unfortunately you can't do machine learning without knowing some basic math---it's like asking somebody for help in programming but not wanting to know about "variables" , "subroutines" and all that if-then stuff. 不幸的是，你不能在不知道一些基本数学的情况下进行机器学习 - 这就像在编程时要求某人帮助但不想知道“变量”，“子程序”以及所有那些if-then的东西。

The better way to do this is called a Bayesian integration, but there is a simpler approximation called "maximum a postieri" (MAP). 更好的方法是称为贝叶斯积分，但有一个更简单的近似称为“最大值后缀”（MAP）。 It's pretty much like the usual thinking except you can put in the prior distribution. 它几乎与通常的想法相似，除了你可以放入先前的分配。

Fancy words, but you may ask, well where did the h/(h+t) formula come from? 花哨的话，但你可能会问，h /（h + t）公式来自哪里？ Of course it's obvious, but it turns out that it is answer that you get when you have "no prior". 当然这很明显，但事实证明，当你没有“事先”时，你会得到答案。 And the method below is the next level of sophistication up when you add a prior. 添加先验后，下面的方法是下一级别的复杂程度。 Going to Bayesian integration would be the next one but that's harder and perhaps unnecessary. 进入贝叶斯整合将是下一个，但这更难，也许是不必要的。

As I understand it the problem is two fold: first you draw a coin from the bag of coins. 据我所知，问题有两个方面：首先你从硬币袋中掏出一枚硬币。 This coin has a "headsiness" called theta, so that it gives a head theta fraction of the flips. 这枚硬币有一个叫做theta的“顽固”，因此它给出了翻转的头部θ分数。 But the theta for this coin comes from the master distribution which I guess I assume is Gaussian with mean P and standard deviation S. 但是这个硬币的theta来自主分布，我想我认为它是高斯分布，平均值为P，标准偏差为S.

What you do next is to write down the total unnormalized probability (called likelihood) of seeing the whole shebang, all the data: (h heads, t tails) 你接下来要做的是写下看到整个shebang的总非标准化概率（称为似然），所有数据:( h head，t tails）

L = (theta)^h * (1-theta)^t * Gaussian(theta; P, S). L =θ^ h *（1-θ）^ t *高斯（θ，P，S）。

Gaussian(theta; P, S) = exp( -(theta-P)^2/(2*S^2) ) / sqrt(2*Pi*S^2) 高斯（θ，P，S）= exp（ - （θ-P）^ 2 /（2 * S ^ 2））/ sqrt（2 * Pi * S ^ 2）

This is the meaning of "first draw 1 value of theta from the Gaussian" and then draw h heads and t tails from a coin using that theta. 这是“首先从高斯绘制θ值1”然后使用theta从硬币中绘制h头和t尾的含义。

The MAP principle says, if you don't know theta, find the value which maximizes L given the data that you do know. MAP原则说，如果你不知道theta，根据你知道的数据找到最大化L的值。 You do that with calculus. 你用微积分做到了。 The trick to make it easy is that you take logarithms first. 让它变得简单的诀窍是你首先采用对数。 Define LL = log(L). 定义LL = log（L）。 Wherever L is maximized, then LL will be too. 无论何时L最大化，LL也将是最大化。

so LL = h log(theta) + t log(1-theta) + -(theta-P)^2 / (2*S^2)) - 1/2 * log(2*pi*S^2) 所以LL = h log（theta）+ t log（1-theta）+ - （θ-P）^ 2 /（2 * S ^ 2）） - 1/2 * log（2 * pi * S ^ 2）

By calculus to look for extrema you find the value of theta such that dLL/dtheta = 0. Since the last term with the log has no theta in it you can ignore it. 通过微积分来寻找极值，你会发现theta的值，使得dLL / dtheta = 0.由于最后一个日志与日志中没有theta，你可以忽略它。

dLL/dtheta = 0 = (h/theta) + (P-theta)/S^2 - (t/(1-theta)) = 0. dLL /dθ= 0 =（h /θ）+（P-θ）/ S ^ 2 - （t /（1-θ））= 0。

If you can solve this equation for theta you will get an answer, the MAP estimate for theta given the number of heads h and the number of tails t. 如果你能解决θ的这个等式，你会得到一个答案，给出头数h和尾数t的θ估计。

If you want a fast approximation, try doing one step of Newton's method, where you start with your proposed theta at the obvious (called maximum likelihood) estimate of theta = h/(h+t). 如果你想要一个快速近似，尝试采用牛顿方法的一个步骤，在那里你从你提出的theta开始在theta = h /（h + t）的明显（称为最大似然）估计。

And where does that 'obvious' estimate come from? 那“明显”的估计来自哪里？ If you do the stuff above but don't put in the Gaussian prior: h/theta - t/(1-theta) = 0 you'll come up with theta = h/(h+t). 如果你做了上面的事情，但没有放入高斯先验：h / theta - t /（1-theta）= 0你会得到theta = h /（h + t）。

If your prior probabilities are really small, as is often the case, instead of near 0.5, then a Gaussian prior on theta is probably inappropriate, as it predicts some weight with negative probabilities, clearly wrong. 如果你的先验概率确实很小，通常情况下，而不是接近0.5，那么在theta上的高斯先验可能是不合适的，因为它预测了一些具有负概率的权重，显然是错误的。 More appropriate is a Gaussian prior on log theta ('lognormal distribution'). 更合适的是log log上的高斯先验（'对数正态分布'）。 Plug it in the same way and work through the calculus. 以相同的方式插入它并完成微积分。

Answer 2

You don't have nearly enough info in this question. 你在这个问题上没有足够的信息。

How many coins are in the box? 盒子里有多少硬币？ If it's two, then in some scenarios (for example one coin is always heads, the other always tails) knowing p and s would be useful. 如果它是两个，那么在某些情况下（例如一个硬币总是头，另一个总是尾巴）知道p和s会很有用。 If it's more than a few, and especially if only some of the coins are only slightly weighted then it is not useful. 如果它不止一些，特别是如果只有一些硬币只是轻微加权那么它就没用了。

What is a small n? 什么是小n？ 2? 2？ 5? 5？ 10? 10？ 100? 100？ What is the probability of a weighted coin coming up heads/tail? 加权硬币出现在头尾的概率是多少？ 100/0, 60/40, 50.00001/49.99999? 100 / 0,60 / 40,50.00001 / 49.99999？ How is the weighting distributed? 权重是如何分配的？ Is every coin one of 2 possible weightings? 每枚硬币有两种可能的重量吗？ Do they follow a bell curve? 它们是否遵循钟形曲线？ etc. 等等

It boils down to this: the differences between a weighted/unweighted coin, the distribution of weighted coins, and the number coins in your box will all decide what n has to be for you to solve this with a high confidence. 它归结为：加权/未加权硬币之间的差异，加权硬币的分布以及盒子中的硬币数量都将决定你必须以高可信度解决这个问题。

The name for what you're trying to do is a Bernoulli trial . 您要做的事情的名称是伯努利试验。 Knowing the name should be helpful in finding better resources. 知道名称应该有助于找到更好的资源。

Response to comment: 回复评论：

If you have differences in p that small, you are going to have to do a lot of trials and there's no getting around it. 如果你的p差异很小，那么你将不得不进行大量的试验，并且没有解决它。

Assuming a uniform distribution of bias, p will still be 0.5 and all standard deviation will tell you is that at least some of the coins have a minor bias. 假设偏差的均匀分布，p仍将为0.5，并且所有标准偏差都会告诉您至少有一些硬币具有轻微的偏差。

How many tosses, again, will be determined under these circumstances by the weighting of the coins. 在这些情况下，通过加权硬币将再次确定多少次投掷。 Even with 500 tosses, you won't get a strong confidence (about 2/3) detecting a .51/.49 split. 即使有500次投掷，你也不会有强烈的信心（约2/3）检测到.51 / .49分裂。

Answer 3

You can use p as a prior on your estimated probability. 您可以使用p作为估计概率的先验值。 This is basically the same as doing pseudocount smoothing. 这与进行伪计量平滑基本相同。 Ie, use 即，使用

(h + c * p) / (n + c)

as your estimate. 作为你的估计。 When h and n are large, then this just becomes h / n . 当h和n很大时，这就变成了h / n 。 When h and n are small, this is just c * p / c = p . 当h和n很小时，这只是c * p / c = p 。 The choice of c is up to you. c的选择取决于你。 You can base it on s but in the end you have to decide how small is too small. 你可以把它建立在s但最后你必须决定它有多小。

Answer 4

In general, what you are looking for is Maximum Likelihood Estimation . 一般来说，您正在寻找的是最大似然估计。 Wolfram Demonstration Project has an illustration of estimating the probability of a coin landing head, given a sample of tosses. Wolfram演示项目有一个例子，给出一个投掷样本估计硬币着陆头的概率。

Answer 5

Well I'm no math man, but I think the simple Bayesian approach is intuitive and broadly applicable enough to put a little though into it. 嗯，我不是数学家，但我认为简单的贝叶斯方法是直观的，广泛适用，足以放入一点点。 Others above have already suggested this, but perhaps if your like me you would prefer more verbosity. 上面的其他人已经提出了这个建议，但也许如果你喜欢我，你会更喜欢冗长。 In this lingo, you have a set of mutually-exclusive hypotheses, H, and some data D, and you want to find the (posterior) probabilities that each hypothesis Hi is correct given the data. 在这个术语中，您有一组互斥的假设H和一些数据D，并且您希望在给定数据的情况下找到每个假设Hi正确的（后验）概率。 Presumably you would choose the hypothesis that had the largest posterior probability (the MAP as noted above), if you had to choose one. 假设您必须选择一个假设，那么您可能会选择具有最大后验概率的假设（如上所述的MAP）。 As Matt notes above, what distinguishes the Bayesian approach from only maximum likelihood (finding the H that maximizes Pr(D|H)) is that you also have some PRIOR info regarding which hypotheses are most likely, and you want to incorporate these priors. 正如Matt在上面所指出的那样，贝叶斯方法与唯一最大似然（找到最大化Pr（D | H）的H）之间的区别在于，您还有一些关于哪些假设最有可能的先验信息，并且您希望合并这些先验。

So you have from basic probability Pr(H|D) = Pr(D|H)*Pr(H)/Pr(D). 所以你有基本概率Pr（H | D）= Pr（D | H）* Pr（H）/ Pr（D）。 You can estimate these Pr(H|D) numerically by creating a series of discrete probabilities Hi for each hypothesis you wish to test, eg [0.0,0.05, 0.1 ... 0.95, 1.0], and then determining your prior Pr(H) for each Hi -- above it is assumed you have a normal distribution of priors, and if that is acceptable you could use the mean and stdev to get each Pr(Hi) -- or use another distribution if you prefer. 你可以通过为你想要测试的每个假设创建一系列离散概率Hi来数值估计这些Pr（H | D），例如[0.0,0.05,0.1 ... 0.95,1.0]，然后确定你的先前Pr（H））对于每个Hi - 上面假设您有正常的先验分布，如果这是可以接受的，您可以使用mean和stdev来获得每个Pr（Hi） - 或者如果您愿意，可以使用其他分布。 With coin tosses the Pr(D|H) is of course determined by the binomial using the observed number of successes with n trials and the particular Hi being tested. 对于硬币投掷，Pr（D | H）当然由二项式确定，使用观察到的n次试验的成功次数和特定的Hi进行测试。 The denominator Pr(D) may seem daunting but we assume that we have covered all the bases with our hypotheses, so that Pr(D) is the summation of Pr(D|Hi)Pr(H) over all H. 分母Pr（D）似乎令人生畏，但我们假设我们已经用我们的假设覆盖了所有基数，因此Pr（D）是Pr（D | Hi）Pr（H）对所有H的总和。

Very simple if you think about it a bit, and maybe not so if you think about it a bit more. 如果你仔细想想它会非常简单，如果你再多想一想，也许不是这样。

如何基于少量证据有效地估计概率？

问题描述

5 个解决方案

解决方案1
3 2009-11-08 19:54:13

解决方案2
2 2009-11-08 17:11:32

解决方案3
2 已采纳 2009-11-08 19:07:23

解决方案4
1 2009-11-08 20:09:11

解决方案5
1

如何基于少量证据有效地估计概率？

问题描述

5 个解决方案

解决方案1 3 2009-11-08 19:54:13

解决方案2 2 2009-11-08 17:11:32

解决方案3 2 已采纳 2009-11-08 19:07:23

解决方案4 1 2009-11-08 20:09:11

解决方案5 1

解决方案1
3 2009-11-08 19:54:13

解决方案2
2 2009-11-08 17:11:32

解决方案3
2 已采纳 2009-11-08 19:07:23

解决方案4
1 2009-11-08 20:09:11

解决方案5
1