简体   繁体   English

KDE 与来自 scipy.integrate.quad 的奇怪行为和设置的带宽的集成

[英]Integration of KDE with strange behavior of from scipy.integrate.quad and the setted bandwith

I was looking for a way to obtaining the mean value (Expected Value) from a drawn distribution that I used to fit a Kernel Density Estimation from scipy.stats.gaussian_kde .我正在寻找一种从绘制分布中获取平均值(期望值)的方法,该分布用于拟合来自scipy.stats.gaussian_kde的 Kernel 密度估计。 I remember from my statistics class that the Expected Value is just the Integral over the pdf(x) * x from -infinity to infinity:我记得从我的统计数据 class 中,期望值只是 pdf(x) * x 从 -infinity 到无穷大的积分:

在此处输入图像描述

I used the the scipy.integrate.quad function to do this task in my code, but I ran into this apperently strange behavior (that might have something to do with the bandwith parameter from the KDE).我使用scipy.integrate.quad function 在我的代码中执行此任务,但我遇到了这种明显奇怪的行为(这可能与 KDE 中的带宽参数有关)。

Problem问题

import matplotlib.pyplot as plt
import numpy as np
import random
from scipy.stats import norm, gaussian_kde
from scipy.integrate import quad
from sklearn.neighbors import KernelDensity

np.random.seed(42)

# Generating sample data
test_array = np.concatenate([np.random.normal(loc=-10, scale=.8, size=100),\
np.random.normal(loc=4,scale=2.0,size=500)])


kde = gaussian_kde(test_array,bw_method=0.5)


X_range = np.arange(-16,20,0.1)

y_list = []

for X in X_range:

    pdf = lambda x : kde.evaluate([[x]])
    y_list.append(pdf(X))

y = np.array(y_list)    

_ = plt.plot(X_range,y)


# Integrate over pdf * x to obtain the mean
mean_integration_low_bw = quad(lambda x: x * pdf(x), a=-np.inf, b=np.inf)[0]

# Calculate the cdf at point of the mean
zero_int_low = quad(lambda x: pdf(x), a=-np.inf, b=mean_integration_low_bw)[0]

print("The mean after integration: {}\n".format(round(mean_integration_low_bw,4)))

print("F({}): {}".format(round(mean_integration_low_bw,4),round(zero_int_low,4)))

plt.axvline(x=mean_integration_low_bw,color ="r")
plt.show()

If I execute this code I get a strange behavior of the result for the integrated mean and the cumulative distribution function at the point of the calculated mean:如果我执行此代码,我会在计算的平均值处得到积分平均值和累积分布 function 的结果的奇怪行为:

在此处输入图像描述

First Question : In my opinion it should always show: F(Mean) = 0.5 or am I wrong here?第一个问题:在我看来,它应该总是显示:F(Mean) = 0.5 还是我错了? (Does this only apply to symetric distributions?) (这只适用于对称分布吗?)

Second Question : The more stranger thing ist, that the value for the integrated mean does not change for the bandwith parameter.第二个问题:更奇怪的是,积分平均值的值不会因带宽参数而改变。 In my opinion the mean should change too if the shape of the underlying distribution differs.在我看来,如果基础分布的形状不同,平均值也应该改变。 If i set the bandwith to 5 I got the following graph:如果我将带宽设置为 5,我会得到以下图表:

在此处输入图像描述

Why is the mean value still the same if the curve now has a different shape (due to the wider bandwith)?如果曲线现在具有不同的形状(由于带宽更宽),为什么平均值仍然相同?

I hope those question not only arise due to my flawed understanding of statistics;)我希望这些问题不仅是由于我对统计数据的理解有缺陷而出现的;)

Your initial data is generate here您的初始数据在此处生成

# Generating sample data
test_array = np.concatenate([np.random.normal(loc=-10, scale=.8, size=100),\
                             np.random.normal(loc=4,scale=2.0,size=500)])

So you have 500 samples from a distribution with mean 4 and 100 samples from a distribution with mean -10 , you can predict the expected average (500*4-10*100)/(500+100) = 1.66666... .因此,您有来自平均值为4的分布的500样本和来自平均值为-10的分布的100样本,您可以预测预期平均值(500*4-10*100)/(500+100) = 1.66666... that's pretty close to the result given by your code, and also very consistent with the result obtained from the with the first plot.这与您的代码给出的结果非常接近,并且与从第一个 plot 获得的结果也非常一致。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM