简体   繁体   中英

np.random.choice not producing expected histogram

I'm looking to generate random normally distributed numbers between 1 and 0, but as the mean moves closer to 1 or 0, the right or left side respectively becomes "squished".

After modifying the normal distribution and playing around with sliders in geogebra, I came up with the following:

方程

mu=0.5, stddev = 0.8

mu=0.75, stddev = 0.8

Next I needed to create a method in python which would generate random samples that would be distributed according to this PDF.

Originally I thought the only way to do this was to try and derive a new equation for generating random numbers as seen in the Box-Muller proof (which I got by following along with this tutorial).

However, I thought there might be an easier way to do this by using the numpy library's np.random.choice() method.

After all, I should be able to integrate the PDF at a very small step size and get the various probabilities for said steps (approximately of course).

So with that I wrote the following script:

# Standard libs
import math

# Third party libs
import numpy as np

from alive_progress import alive_bar
from matplotlib import pyplot as plt

class RandomNumberGenerator:
    def __init__(self):
        pass

    def clamped_normal_distribution(self, mu: float, 
            stddev: float, x: float):
        """ Computes a value from the clamped normal distribution """
        divideByZeroAvoider = 1e-5
        if x < 0 or x > 1:
            return 0
        elif x >= 0 and x <= mu:
            return math.exp(-0.5*( (x - mu) / (stddev)  )**2 \
                    * (1/(x**2 + divideByZeroAvoider)))
        elif x <= 1 and x > mu:
            return math.exp(-0.5*( (x - mu) / (stddev)  )**2 \
                    * (1/((1-x)**2 + divideByZeroAvoider))) 
        else:
            print("This shouldn't happen!: {}".format(x))
            return 0

if __name__ == '__main__':
    rng = RandomNumberGenerator()

    mu = 0.7
    stddev = 1
    stepSize = 1e-3
    x = np.linspace(stepSize,1, int(1/stepSize) - 1)

    # Determine the total area under the curve
    samples = []
    print("Generating samples...")
    with alive_bar(len(x.tolist())) as bar:
        for i in x:
            samples.append(rng.clamped_normal_distribution(
                    mu, stddev, i))
            bar()

    area = np.trapz(samples, dx=stepSize)
    print("Area = {}".format(area))

    # Determine the probability of x falling in a specific interval
    probabilities = []

    print("Generating probabilties...")
    with alive_bar(len(x.tolist())) as bar:
        for i in x:
            lead = rng.clamped_normal_distribution(mu, 
                    stddev, i)
            lag = rng.clamped_normal_distribution(mu, 
                    stddev, i - stepSize)
            probability = np.trapz(
                    np.array([lag, lead]), 
                    dx=stepSize)
            
            # Divide by the area because this isn't a standard normal
            probabilities.append(probability / area)
            bar()
    
    # Should be approximately 1
    print("Probability: {}".format(sum(probabilities)))

    plt.plot(x, probabilities)
    plt.show()

    y = []
    print("Performing distribution test...")
    testSize = int(10e3)
    with alive_bar(testSize) as bar:
        for _ in range(testSize):
            randSamp = np.random.choice(samples, p=probabilities)
            y.append(randSamp)
            bar()

    plt.hist(y,300)
    plt.show()

The first plot of the probabilities against the linearly spaced samples looks promising, giving me the following graph:

在此处输入图像描述

However, if we use these samples as choices with given probabilities, we get the following histogram:

在此处输入图像描述

I have no idea why this isn't working correctly.

I've tried other (smaller) examples like the ones listed on the numpy website , and they produce histograms of the according to the given probabilities array.

I'd really appreciate some advice/intuition if at all possible:).

It looks like there is a problem with the first argument in the call np.random.choice(samples, p=probabilities) . The first argument should be x , not samples .

ADDITION BY AUTHOR:

The reason for this is the samples are the values of the curve (ie the y-axis and NOT the x-axis).

Thus the values with the highest probabilities (ie the samples around the mean) all have a value of ~1, which is why we see such a massive spike around the value 1.

Changing this to x gives us the following graphs (for 10e3 samples):

在此处输入图像描述 在此处输入图像描述 在此处输入图像描述 在此处输入图像描述

Working as expected, very nice.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM