简体   繁体   中英

Python: Kernel Density Estimation for positive values

I want to get kernel density estimation for positive data points. Using Python Scipy Stats package, I came up with the following code.

def get_pdf(data):
    a = np.array(data)
    ag = st.gaussian_kde(a)
    x = np.linspace(0, max(data), max(data))
    y = ag(x)
    return x, y

This works perfectly for most data sets, but it gives an erroneous result for "all positive" data points. To make sure this works correctly, I use numerical integration to compute the area under this curve.

def trapezoidal_2(ag, a, b, n):
    h = np.float(b - a) / n
    s = 0.0
    s += ag(a)[0]/2.0
    for i in range(1, n):
        s += ag(a + i*h)[0]
    s += ag(b)[0]/2.0
    return s * h

Since the data is spread in the region (0, int(max(data))), we should get a value close to 1, when executing the following line.

b = 1
data = st.pareto.rvs(b, size=10000)
data = list(data)

a = np.array(data)
ag = st.gaussian_kde(a)
trapezoidal_2(ag, 0, int(max(data)), int(max(data))*2)

But it gives a value close to 0.5 when I test.

But when I intergrate from -100 to max(data), it provides a value close to 1.

trapezoidal_2(ag, -100, int(max(data)), int(max(data))*2+200)

The reason is, ag (KDE) is defined for values less than 0, even though the original data set contains only positive values.

So how can I get a kernel density estimation that considers only positive values, such that area under the curve in the region (o, max(data)) is close to 1?

The choice of the bandwidth is quite important when performing kernel density estimation. I think the Scott's Rule and Silverman's Rule work well for distribution similar to a Gaussian. However, they do not work well for the Pareto distribution.

Quote from the doc :

Bandwidth selection strongly influences the estimate obtained from the KDE (much more so than the actual shape of the kernel). Bandwidth selection can be done by a "rule of thumb", by cross-validation, by "plug-in methods" or by other means; see [3] , [4] for reviews. gaussian_kde uses a rule of thumb, the default is Scott's Rule.

Try with different bandwidth values, for example:

import numpy as np
import matplotlib.pyplot as plt

from scipy import stats

b = 1

sample = stats.pareto.rvs(b, size=3000)
kde_sample_scott = stats.gaussian_kde(sample, bw_method='scott')
kde_sample_scalar = stats.gaussian_kde(sample, bw_method=1e-3)


# Compute the integrale:
print('integrale scott:', kde_sample_scott.integrate_box_1d(0, np.inf))
print('integrale scalar:', kde_sample_scalar.integrate_box_1d(0, np.inf))

# Graph:
x_span = np.logspace(-2, 1, 550)
plt.plot(x_span, stats.pareto.pdf(x_span, b), label='theoretical pdf')
plt.plot(x_span, kde_sample_scott(x_span), label="estimated pdf 'scott'")
plt.plot(x_span, kde_sample_scalar(x_span), label="estimated pdf 'scalar'")
plt.xlabel('X'); plt.legend();

gives:

integrale scott: 0.5572130540733236
integrale scalar: 0.9999999999968957

and:

pareto和kde

We see that the kde using the Scott method is wrong.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM