Python: Kernel Density Estimation for positive values

Question

I want to get kernel density estimation for positive data points. Using Python Scipy Stats package, I came up with the following code.

def get_pdf(data):
    a = np.array(data)
    ag = st.gaussian_kde(a)
    x = np.linspace(0, max(data), max(data))
    y = ag(x)
    return x, y

This works perfectly for most data sets, but it gives an erroneous result for "all positive" data points. To make sure this works correctly, I use numerical integration to compute the area under this curve.

def trapezoidal_2(ag, a, b, n):
    h = np.float(b - a) / n
    s = 0.0
    s += ag(a)[0]/2.0
    for i in range(1, n):
        s += ag(a + i*h)[0]
    s += ag(b)[0]/2.0
    return s * h

Since the data is spread in the region (0, int(max(data))), we should get a value close to 1, when executing the following line.

b = 1
data = st.pareto.rvs(b, size=10000)
data = list(data)

a = np.array(data)
ag = st.gaussian_kde(a)
trapezoidal_2(ag, 0, int(max(data)), int(max(data))*2)

But it gives a value close to 0.5 when I test.

But when I intergrate from -100 to max(data), it provides a value close to 1.

trapezoidal_2(ag, -100, int(max(data)), int(max(data))*2+200)

The reason is, ag (KDE) is defined for values less than 0, even though the original data set contains only positive values.

So how can I get a kernel density estimation that considers only positive values, such that area under the curve in the region (o, max(data)) is close to 1?

Answer 1

The choice of the bandwidth is quite important when performing kernel density estimation. I think the Scott's Rule and Silverman's Rule work well for distribution similar to a Gaussian. However, they do not work well for the Pareto distribution.

Quote from the doc :

Bandwidth selection strongly influences the estimate obtained from the KDE (much more so than the actual shape of the kernel). Bandwidth selection can be done by a "rule of thumb", by cross-validation, by "plug-in methods" or by other means; see [3] , [4] for reviews. gaussian_kde uses a rule of thumb, the default is Scott's Rule.

Try with different bandwidth values, for example:

import numpy as np
import matplotlib.pyplot as plt

from scipy import stats

b = 1

sample = stats.pareto.rvs(b, size=3000)
kde_sample_scott = stats.gaussian_kde(sample, bw_method='scott')
kde_sample_scalar = stats.gaussian_kde(sample, bw_method=1e-3)


# Compute the integrale:
print('integrale scott:', kde_sample_scott.integrate_box_1d(0, np.inf))
print('integrale scalar:', kde_sample_scalar.integrate_box_1d(0, np.inf))

# Graph:
x_span = np.logspace(-2, 1, 550)
plt.plot(x_span, stats.pareto.pdf(x_span, b), label='theoretical pdf')
plt.plot(x_span, kde_sample_scott(x_span), label="estimated pdf 'scott'")
plt.plot(x_span, kde_sample_scalar(x_span), label="estimated pdf 'scalar'")
plt.xlabel('X'); plt.legend();

gives:

integrale scott: 0.5572130540733236
integrale scalar: 0.9999999999968957

and:

We see that the kde using the Scott method is wrong.

Python: Kernel Density Estimation for positive values

Question

1 answers

solution1
2 ACCPTED 2018-09-10 14:30:51

Python: Kernel Density Estimation for positive values

Question

1 answers

solution1 2 ACCPTED 2018-09-10 14:30:51

solution1
2 ACCPTED 2018-09-10 14:30:51