简体   繁体   中英

Matching moments of fitted parametric distribution in Python is inaccurate

In the following code, I generate a random variable x which I know is Normally distributed, so I fit a parametric Normal distribution to it (through maximum likelihood estimation of the parameters) in order to simulate a synthetic variable called y that should match the properties of the original data. The statistical moments (mean, standard deviation) of x and y should match too.

Why then don't the moments of y 's distribution, match that of x 's distribution? In one run of the code below, the mean of y (0.052) could be 5 times as large as x 's (0.01), sometimes even negative when it should be positive

import numpy as np
from scipy.stats import norm

n = 2000
x = norm.rvs(size=n)
y = norm(*norm.fit(x)).rvs(size=n)

for i in [x,y]:
    print("mu={:.4f}, sd={:.4f}".format(np.mean(i), np.std(i)))

Why then don't the moments of y's distribution, match that of x's distribution?

They do - or at least they do within the expected error (1)

A quick observation is they are both close to the standard normal distribution; their first moments are both close to 0 and their second moment close to 1. However, notice that x is sampled from N(0,1) and y is sampled from N(mean(x), std(x)) .

Large sample size n

If you want their values to be closer than simply increase the sample size n . We'll fix the random_state for reproducibility 2

import numpy as np
from scipy.stats import norm

n = 200000

for i in range(5):
    x = norm.rvs(size=n, random_state=i)
    y = norm(*norm.fit(x)).rvs(size=n, random_state=i)

    print("Trial {}".format(i))
    for i in [x, y]:
        print("mu={:.4f}, sd={:.4f}".format(np.mean(i), np.std(i)))

This yields:

Trial 0
mu=0.0033, sd=0.9980
mu=0.0067, sd=0.9960
Trial 1
mu=0.0045, sd=0.9977
mu=0.0089, sd=0.9953
Trial 2
mu=-0.0004, sd=0.9981
mu=-0.0008, sd=0.9963
Trial 3
mu=-0.0019, sd=0.9965
mu=-0.0037, sd=0.9930
Trial 4
mu=-0.0052, sd=0.9992
mu=-0.0104, sd=0.9984

Small sample size n

On small sample size n , we'd naturally expect some discrepancy between x and y because we are actually drawing another sample from y . However, the we can observe how the fitted parameters behave as expected like so:

n = 200
for i in range(5):
    x = norm.rvs(size=n, random_state=i)    
    print("Trial {}".format(i))
    print(np.mean(x), np.std(x), norm(*norm.fit(x)).args)

This yields

Trial 0
0.07091049314116117 1.0214227686959954 (0.07091049314116117, 1.0214227686959954)
Trial 1
0.1066888148479486 0.9100459829739235 (0.1066888148479486, 0.9100459829739235)
Trial 2
0.012250008696874187 1.0800421002497833 (0.012250008696874187, 1.0800421002497833)
Trial 3
-0.07079063505988327 0.9767123391405987 (-0.07079063505988327, 0.9767123391405987)
Trial 4
0.028540839305884236 0.9537561748836348 (0.028540839305884236, 0.9537561748836348)

(1) Have not actually calculated the standard error so correct me if I'm wrong. A quick search at Cross Validated gives a nice explanation about Standard Error in general.

(2) Fixing the random state x and norm(*norm.fit(x)) does not imply random samples from the later should produce N(mean(x), std(x)) . Then again, refering to (1) above Why should it?.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM