简体   繁体   中英

Generate random variables from a probability distribution

I have extracted some variables from my python data set and I want to generate a larger data set from the distributions I have. The problem is that I am trying to introduce some variability to the new data set while maintaining the similar behaviour. This is an example of my extracted data that consists of 400 observations:

Value    Observation Count     Ratio of Entries
1        352                    0.88
2        28                     0.07
3        8                      0.02
4        4                      0.01
7        4                      0.01
13       4                      0.01

Now I am trying to use this information to generate a similar dataset with 2,000 observations. I am aware of the numpy.random.choice and the random.choice functions, but I do not want to use the exact same distributions. Instead I would like to generate random variables (the values column) based from the distribution but with more variability. An example of how I want my larger data set to look like:

Value         Observation Count        Ratio of Entries
1             1763                     0.8815
2             151                      0.0755
3             32                       0.0160
4             19                       0.0095
5             10                       0.0050
6             8                        0.0040
7             2                        0.0010
8             4                        0.0020
9             2                        0.0010
10            3                        0.0015
11            1                        0.0005
12            1                        0.0005
13            1                        0.0005
14            2                        0.0010
15            1                        0.0005

So the new distribution is something that could be estimated if I fitted my original data with an exponential decay function, however, I am not interested in continuous variables. How do I get around this and is there a particular or mathematical method relevant to what I am trying to do?

It sounds like you want to generate data based on the PDF described in the second table. The PDF is something like

0 for x <= B
A*exp(-A*(x-B)) for x > B

A defines the width of your distribution, which will always be normalized to have an area of 1. B is the horizontal offset, which is zero in your case. You can make it an integer distribution by binning with ceil .

The CDF of a normalized decaying exponential is 1 - exp(-A*(xB)) . Generally, a simple way to make a custom distribution is to generate uniform numbers and map them through the CDF.

Fortunately, you won't have to do that, since scipy.stats.expon already provides the implementation you are looking for. All you have to do is fit to the data in your last column to get A ( B is clearly zero). You can easily do this with curve_fit . Keep in mind that A maps to 1.0/scale in scipy PDF language.

Here is some sample code. I've added an extra layer of complexity here by computing the integral of the objective function from n-1 to n for integer inputs, taking the binning into account for you when doing the fit.

import numpy as np
from scipy.optimize import curve_fit
from scipy.stats import expon

def model(x, a):
    return np.exp(-a * (x - 1)) - exp(-a * x)
    #Alternnative:
    # return -np.diff(np.exp(-a * np.concatenate(([x[0] - 1], x))))

x = np.arange(1, 16)
p = np.array([0.8815, 0.0755, ..., 0.0010, 0.0005])
a = curve_fit(model, x, p, 0.01)
samples = np.ceil(expon.rvs(scale=1/a, size=2000)).astype(int)
samples[samples == 0] = 1
data = np.bincount(samples)[1:]

If you have an exponential decay, the underlying discrete probability distribution is a geometric distribution . (It's the discrete counterpart of the continuous exponential distribution .) Such a geometric distribution uses a parameter p with the probability of success of one trial (like a biased coin toss). The distribution describes the number of trials needed to get one success.

The expected mean of the distribution is 1/p . So, we can calculate the mean of the observations to estimate p .

The function forms part of scipy as scipy.stats.geom . To sample the distribution, use geom.rvs(estimated_p, size=2000) .

Here is some code to demonstrate the approach:

from scipy.stats import geom
import matplotlib.pyplot as plt
import numpy as np
from collections import defaultdict

observation_index = [1, 2, 3, 4, 7, 13]
observation_count = [352, 28, 8, 4, 4, 4]

observed_mean = sum([i * c for i, c in zip(observation_index, observation_count)]) / sum(observation_count)

estimated_p = 1 / observed_mean
print('observed_mean:', observed_mean)
print('estimated p:', estimated_p)

generated_values = geom.rvs(estimated_p, size=2000)
generated_dict = defaultdict(int)
for v in generated_values:
    generated_dict[v] += 1
generated_index = sorted(list (generated_dict.keys()))
generated_count = [generated_dict [i] for i in  generated_index]
print(generated_index)
print(generated_count)

Output:

observed_mean: 1.32
estimated p: 0.7575757575757576
new random sample:
    [1, 2, 3, 4, 5, 7]
    [1516, 365, 86, 26, 6, 1]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM