简体   繁体   中英

Estimating parameters of binomial distribution to use as machine learning features

I'm working with genetic data in which alleles were observed n times in t number of chromosomes sequenced. In other words, n successes in t trials.

I want to include an estimate of each allele's frequency as a feature in a machine learning algorithm. I can of course get a point estimate with n/t, but I want to represent the confidence of that point estimate -- ie something about the likelihood of that estimate.

Now, I believe the negative binomial (or just binomial) distribution would be the right one to use, but

  1. How can I estimate the parameters of the distribution in Python?
  2. What representation of the distribution would be ideal as a feature for classical (non-NN) machine learning? A conservative estimate might be the 95% CI upper bound, but how would I calculate that, and is there a better way to featurize the distribution than just taking that one value?

Thanks!

I suppose that all of the required information that you need can be calculated by mean of the standard statistical methods without applying machine learning.

  1. MLE estimate of the parameter p of your Binomial distribution Bin(t,p) is just n/t as you properly suggested. If you want to get a confidence interval instead of a point estimate, there is one way to do it by means of the Wald method :

    瓦尔德法

    where z is 1 - 0.5α quantile of a standard normal distribution. You can find more possibilities via the following link depending on your modelling assumptions: Binomial confidence intervals .

  2. 95% CI for can be calculated as indicated above with z = 1.96 .

  3. As for the feature engineering for the machine learning algorithm: since your parametric distribution basically depends only on one estimated parameter p (except for t which is given), you can use it directly as a feature for the unique distribution representation. It is also possible to add CI or variance as additional features of course. Everything depends on what exactly you are going to learn and what is your final objective/criterion is.

Binoculars implements many methods for calculating binomial confidence intervals. (PS: i am the author of Binoculars).

pip install bincoulars

If N=(total chromosomes sequenced) and p=(number of times allele is observed / N) , you can estimate the confidence interval straightforwardly:

from binoculars import binomial_confidence

N, p = 100, 0.2

binomial_confidence(p, N)
# (0.1307892803998113, 0.28628125447599173)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM