简体   繁体   中英

How to plot normal distribution curve along with Central Limit theorem

I am trying to get a normal distribution curve along my Central limit data distribution.

Below is the implementation I have tried.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import math

# 1000 simulations of die roll
n = 10000

avg = []
for i in range(1,n):#roll dice 10 times for n times
    a = np.random.randint(1,7,10)#roll dice 10 times from 1 to 6 & capturing each event
    avg.append(np.average(a))#find average of those 10 times each time

plt.hist(avg[0:])

zscore = stats.zscore(avg[0:])

mu, sigma = np.mean(avg), np.std(avg)
s = np.random.normal(mu, sigma, 10000)

# Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)

# Plot the distribution curve
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2)))

I get the below graph,

在此处输入图片说明

You can see the normal curve in the red at the bottom.

Can anyone tell me why the curve is not fitting ?

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import math

# 1000 simulations of die roll
n = 10000

avg = []
for i in range(1,n):#roll dice 10 times for n times
    a = np.random.randint(1,7,10)#roll dice 10 times from 1 to 6 & capturing each event
    avg.append(np.average(a))#find average of those 10 times each time

plt.hist(avg[0:],20,normed=True)

zscore = stats.zscore(avg[0:])

mu, sigma = np.mean(avg), np.std(avg)
s = np.random.normal(mu, sigma, 10000)

# Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)

# Plot the distribution curve
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2)))

I have just scaled down the avg list histogram.

Plot:-

在此处输入图片说明

You almost had it! First, see that you're plotting two histograms on the same axes:

plt.hist(avg[0:])

and

plt.hist(s, 20, normed=True)

So that you can plot the normal density over the histogram you rightly normalised the second plot with the normed=True argument. However, you forgot to normalise the first histogram too ( plt.hist(avg[0:]), normed=True ).

I'd also recommend that since you've already imported scipy.stats , you may as well use the normal distribution that comes in that module, rather than coding the pdf yourself.

Putting this all together we have:

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# 1000 simulations of die roll
n = 10000

avg = []
for i in range(1,n):
    a = np.random.randint(1,7,10)
    avg.append(np.average(a))

# CHANGED: normalise this histogram too
plt.hist(avg[0:], 20, normed=True)

zscore = stats.zscore(avg[0:])

mu, sigma = np.mean(avg), np.std(avg)
s = np.random.normal(mu, sigma, 10000)

# Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)

# Use scipy.stats implementation of the normal pdf
# Plot the distribution curve
x = np.linspace(1.5, 5.5, num=100)
plt.plot(x, stats.norm.pdf(x, mu, sigma))

Which gave me the following plot:

在此处输入图片说明

Edit

In the comments you asked:

  1. How did I choose 1.5 and 5.5 in np.linspace
  2. Is it possible to plot the normal kernel over the non-normalised histogram?

To address q1. first, I chose 1.5 and 5.5 by eye. After plotting the histogram I saw that the histogram bins looked to range between 1.5 and 5.5, so that is the range over which we'd like to plot the normal distribution.

A more programmatic way of choosing this range would have been:

x = np.linspace(bins.min(), bins.max(), num=100)

As for question 2., yes, we can achieve what you want. However, you should know that we'd no longer be plotting a probability density function at all.

After removing the normed=True argument when plotting the histograms:

x = np.linspace(bins.min(), bins.max(), num=100)

# Find pdf of normal kernel at mu
max_density = stats.norm.pdf(mu, mu, sigma)
# Calculate how to scale pdf
scale = count.max() / max_density

plt.plot(x, scale * stats.norm.pdf(x, mu, sigma))

This gave me the following plot: 在此处输入图片说明

the logic is seemed to be correct.

the problem is with the showing the data.

try normalizing the first histogram with normed=true and having a equal bins for both histograms. like 20 bins.

The throwing of the dice is a case of uniform distribution. The probability of any number from 1 to 6 turning up is 1/6. So the mean and standard deviation are given by

在此处输入图片说明

Now, CLT says that, for sufficiently large value of n, which is 10 in the code, the pdf of the mean of the n throws, will approach a normal distribution with mean 3.5 and standard deviation 1.7078/sqrt(10)

n_bins=50
pdf_from_hist, bin_edges=np.histogram(np.array(avg), bins=n_bins, density=True)
bin_mid_pts= np.add(bin_edges[:-1], bin_edges[1:])*0.5
assert(len(list(pdf_from_hist))  == len(list(bin_mid_pts)))
expected_std=1.7078/math.sqrt(10)
expected_mean=3.5
pk_s=[]
qk_s=[]
for i in range(n_bins):
    p=stat.norm.pdf(bin_mid_pts[i], expected_mean, expected_std) 
    q=pdf_from_hist[i]
    if q <= 1.0e-5:
        continue
    pk_s.append(p)
    qk_s.append(q)
#compute the kl divergence
kl_div=stat.entropy(pk_s, qk_s)
print('the pdf of the mean of the 10 throws differ from the corresponding normal dist with a kl divergence of %r' % kl_div)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM