Scipy rv_continuous incorrectly generating sample from distribution

Question

from scipy import stats
import numpy as np 

class your_distribution(stats.rv_continuous):
    def _pdf(self, x):
        p0 = 10.9949
        p1 = 0.394447
        p2 = 12818.4
        p3 = 2.38898

        return ((p1*p3)/(p3*p0+p2*p1))*((p0*np.exp(-1.0*p1*x))+(p2*np.exp(-1.0*p3*x)))

distribution = your_distribution(a=0.15, b=10.1)
sample = distribution.rvs(size=50000)

The code above generates 50000 samples from a normalized pdf in the range 0.15 to 10.1. However, there is a disproportionately large number of samples generated at the upper bound b=10.1 . This does not make sense, as seen when the pdf is plotted.

How would I fix this issue?

Answer 1

The PDF is correctly normalized for the full range of the distribution. However, setting a and b simply cuts the PDF without any re-normalization. With (a=0.15, b=10.1) the PDF no longer integrates to 1, and by a quirk of the scipy implementation the remaining density is apparently added at the end of the range. This causes the large number of samples at the upper bound.

We can visualize what is going on by plotting the cumulative density function (CDF) for a=0 and a=0.15:

x = np.linspace(0, 15, 1000)

distribution = your_distribution(a=0.0, b=10.1)
plt.plot(x, distribution.cdf(x), label='a=0')

distribution = your_distribution(a=0.15, b=10.1)
plt.plot(x, distribution.cdf(x), label='a=0.15')

plt.legend()

To get rid of the jump in the CDF and the spurious samples at the upper range we need to re-normalize the PDF for the a..b range. I am too lazy to analytically work out the correct factor, so let's make scipy do the hard work:

from scipy import stats
from scipy.integrate import quad
import numpy as np

# I pulled the definition of the PDF out of the class so we can use it to
# compute the scale factor.
def pdf(x):
    p0 = 10.9949
    p1 = 0.394447
    p2 = 12818.4
    p3 = 2.38898

    return ((p1*p3)/(p3*p0+p2*p1))*((p0*np.exp(-1.0*p1*x))+(p2*np.exp(-1.0*p3*x)))    

class your_distribution(stats.rv_continuous):        
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        # integrate area of the PDF in range a..b
        self.scale, _ = quad(pdf, self.a, self.b)

    def _pdf(self, x):
        return pdf(x) / self.scale  # scale PDF so that it integrates to 1 in range a..b 

distribution = your_distribution(a=0.15, b=10.1)
sample = distribution.rvs(size=1000)

If you happen to know the analytical solution of the integral you can use it instead of the call to quad .

Scipy rv_continuous incorrectly generating sample from distribution

Question

1 answers

solution1
2 ACCPTED 2018-06-14 14:03:34

Scipy rv_continuous incorrectly generating sample from distribution

Question

1 answers

solution1 2 ACCPTED 2018-06-14 14:03:34

solution1
2 ACCPTED 2018-06-14 14:03:34