I am trying to get a normal distribution curve along my Central limit data distribution.
Below is the implementation I have tried.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import math
# 1000 simulations of die roll
n = 10000
avg = []
for i in range(1,n):#roll dice 10 times for n times
a = np.random.randint(1,7,10)#roll dice 10 times from 1 to 6 & capturing each event
avg.append(np.average(a))#find average of those 10 times each time
plt.hist(avg[0:])
zscore = stats.zscore(avg[0:])
mu, sigma = np.mean(avg), np.std(avg)
s = np.random.normal(mu, sigma, 10000)
# Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)
# Plot the distribution curve
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2)))
I get the below graph,
You can see the normal curve in the red at the bottom.
Can anyone tell me why the curve is not fitting ?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import math
# 1000 simulations of die roll
n = 10000
avg = []
for i in range(1,n):#roll dice 10 times for n times
a = np.random.randint(1,7,10)#roll dice 10 times from 1 to 6 & capturing each event
avg.append(np.average(a))#find average of those 10 times each time
plt.hist(avg[0:],20,normed=True)
zscore = stats.zscore(avg[0:])
mu, sigma = np.mean(avg), np.std(avg)
s = np.random.normal(mu, sigma, 10000)
# Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)
# Plot the distribution curve
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2)))
I have just scaled down the avg list histogram.
Plot:-
You almost had it! First, see that you're plotting two histograms on the same axes:
plt.hist(avg[0:])
and
plt.hist(s, 20, normed=True)
So that you can plot the normal density over the histogram you rightly normalised the second plot with the normed=True
argument. However, you forgot to normalise the first histogram too ( plt.hist(avg[0:]), normed=True
).
I'd also recommend that since you've already imported scipy.stats
, you may as well use the normal distribution that comes in that module, rather than coding the pdf yourself.
Putting this all together we have:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
# 1000 simulations of die roll
n = 10000
avg = []
for i in range(1,n):
a = np.random.randint(1,7,10)
avg.append(np.average(a))
# CHANGED: normalise this histogram too
plt.hist(avg[0:], 20, normed=True)
zscore = stats.zscore(avg[0:])
mu, sigma = np.mean(avg), np.std(avg)
s = np.random.normal(mu, sigma, 10000)
# Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)
# Use scipy.stats implementation of the normal pdf
# Plot the distribution curve
x = np.linspace(1.5, 5.5, num=100)
plt.plot(x, stats.norm.pdf(x, mu, sigma))
Which gave me the following plot:
In the comments you asked:
np.linspace
To address q1. first, I chose 1.5 and 5.5 by eye. After plotting the histogram I saw that the histogram bins looked to range between 1.5 and 5.5, so that is the range over which we'd like to plot the normal distribution.
A more programmatic way of choosing this range would have been:
x = np.linspace(bins.min(), bins.max(), num=100)
As for question 2., yes, we can achieve what you want. However, you should know that we'd no longer be plotting a probability density function at all.
After removing the normed=True
argument when plotting the histograms:
x = np.linspace(bins.min(), bins.max(), num=100)
# Find pdf of normal kernel at mu
max_density = stats.norm.pdf(mu, mu, sigma)
# Calculate how to scale pdf
scale = count.max() / max_density
plt.plot(x, scale * stats.norm.pdf(x, mu, sigma))
the logic is seemed to be correct.
the problem is with the showing the data.
try normalizing the first histogram with normed=true
and having a equal bins for both histograms. like 20 bins.
The throwing of the dice is a case of uniform distribution. The probability of any number from 1 to 6 turning up is 1/6. So the mean and standard deviation are given by
Now, CLT says that, for sufficiently large value of n, which is 10 in the code, the pdf of the mean of the n throws, will approach a normal distribution with mean 3.5 and standard deviation 1.7078/sqrt(10)
n_bins=50
pdf_from_hist, bin_edges=np.histogram(np.array(avg), bins=n_bins, density=True)
bin_mid_pts= np.add(bin_edges[:-1], bin_edges[1:])*0.5
assert(len(list(pdf_from_hist)) == len(list(bin_mid_pts)))
expected_std=1.7078/math.sqrt(10)
expected_mean=3.5
pk_s=[]
qk_s=[]
for i in range(n_bins):
p=stat.norm.pdf(bin_mid_pts[i], expected_mean, expected_std)
q=pdf_from_hist[i]
if q <= 1.0e-5:
continue
pk_s.append(p)
qk_s.append(q)
#compute the kl divergence
kl_div=stat.entropy(pk_s, qk_s)
print('the pdf of the mean of the 10 throws differ from the corresponding normal dist with a kl divergence of %r' % kl_div)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.