[英]How to calculate the probability between two numbers from a probability distribution in python
I've always thought it would be useful to calculate the probability between two values on a probability distribution.我一直认为计算概率分布上两个值之间的概率会很有用。 While there isn't a built-in way to do this using seaborn or matplotlib, I reckon it just takes some basic calculus, right?
虽然没有使用 seaborn 或 matplotlib 的内置方法,但我认为它只需要一些基本的微积分,对吧? Here is some code I found from an article on this topic :
这是我从有关此主题的文章中找到的一些代码:
from sklearn.neighbors import KernelDensity
import numpy as np
x = np.random.normal(loc=0.0, scale=1.0, size=1000000)
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(x.mean() - x.std(), x.mean() + x.std(), 100, kd)
0.6338
This returns a probability that converges at 0.6338.这将返回一个收敛于 0.6338 的概率。 This confused me, as the 68-95-99.7 rule states that the probability of a value being within one standard deviation of the mean in either direction should be 68%.
这让我很困惑,因为68-95-99.7 规则规定,一个值在任一方向的平均值的一个标准差内的概率应该是 68%。
I decided to run another test by calculating the probability between the median and max of a randomly generated sample, figuring it should converge close to 50%:我决定通过计算随机生成样本的中值和最大值之间的概率来运行另一个测试,认为它应该收敛到接近 50%:
x = np.random.randint(100, size=(1000000))
# sns.kdeplot(x) # this is how i'd generate a kdeplot of this data
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(np.median(x), x.max(), 100, kd)
0.4946
And it's pretty close.它非常接近。 Am I missing something here?
我在这里错过了什么吗? Why am I nearly 5 percentage points off from the 68-95-99.7 rule?
为什么我与 68-95-99.7 规则相差近 5 个百分点? Is this method of generating probabilities from a probability distribution wrong?
这种从概率分布生成概率的方法是错误的吗? Is there a better way to find the probability between two values from a probability distribution?
有没有更好的方法从概率分布中找到两个值之间的概率?
EDIT: Could you potentially calculate something by using the data generated from a kdeplot?编辑:您是否可以通过使用从 kdeplot 生成的数据来计算一些东西?
fig, ax = plt.subplots()
sns.kdeplot(x)
kdeline = ax.lines[0]
xs = kdeline.get_xdata()
ys = kdeline.get_ydata()
And implement np.interp()
somehow?并以某种方式实现
np.interp()
?
More edits:更多编辑:
Using CDFs per @7shoe, I was able to get a way better (and correct) result for my normal distribution example:使用每个@7shoe 的 CDF,我能够为我的正态分布示例获得更好(和正确)的结果:
from scipy.stats import norm
import numpy as np
np.random.seed(42)
x = np.random.normal(loc=0.0, scale=1.0, size=10000000)
norm.cdf(x.mean() + x.std()) - norm.cdf(x.mean() - x.std())
However, my curiosity is still piqued.然而,我的好奇心仍然被激起。 Let's say we have a distribution that may or may not be normal.
假设我们的分布可能正常,也可能不正常。 For example, let's look at Tom Brady's epa per pass from last season
例如,让我们看看汤姆布拉迪上赛季的每传球得分
import pandas as pd
import seaborn as sns
import random
import numpy as np
YEAR = 2021
data = pd.read_csv(
'https://github.com/nflverse/nflfastR-data/blob/master/data/play_by_play_' \
+ str(YEAR) + '.csv.gz?raw=True',compression='gzip', low_memory=False
)
df = data.loc[data.passer == 'T.Brady','epa'].copy()
# tom brady's distribution
sns.kdeplot(df)
sample_mean = []
for i in range(50):
y = np.random.choice(df, 500)
avg = np.mean(y)
sample_mean.append(avg)
# distribution of sampling means - can we assume this is normal and proceed with cdfs?
sns.kdeplot(sample_mean)
Could we use sampling means or even just bootstrap resampling methods to我们可以使用抽样方法,甚至只是引导重抽样方法
or或者
Computing the probability p
for some interval is not overly complicated.计算某个区间的概率
p
并不过分复杂。 However, it might be tricky to combine the right tools to do so.但是,结合正确的工具来做到这一点可能会很棘手。 In particular, since there are several statistical approaches to do so.
特别是,因为有几种统计方法可以做到这一点。
Given two numbers, let's call them lower
and upper
, what probability is enclosed in between them?给定两个数字,我们称它们为
lower
和upper
,它们之间的概率是多少? If the cumulative distribution function ( CDF ) F
is known, it is merely p = F(upper) - F(lower)
.如果累积分布 function ( CDF )
F
已知,则仅p = F(upper) - F(lower)
。 Similarly, p
coincides with the area enclosed by the probability density function (PDF) f
's graph on the interval [lower, upper]
.类似地,
p
与区间[lower, upper]
上的概率密度 function (PDF) f
的图所包围的区域一致。
However, when the CDF/PDF is unknown, it constitutes a statistical question.然而,当 CDF/PDF 未知时,它就构成了一个统计问题。 In a nutshell, estimating the PDF
f
and computing the area its graph enclosed with the interval will do.简而言之,估计 PDF
f
并计算其图形包含在区间内的区域即可。 But there are several paradigms and estimation procedures to obtain it.但是有几个范式和估计程序来获得它。
One could assume that the data x
is set of IID realizations of some normal distribution, either because of prior knowledge or convenience.可以假设数据
x
是一些正态分布的 IID 实现集,这可能是因为先验知识或方便。 Then, one just needs to estimate its parameters mu (aka scale
) and sigma
(aka standard deviation or scale
).然后,只需要估计它的参数mu (aka
scale
) 和sigma
(aka standard deviation 或scale
)。 scipy.stats
provides all we need in this setting. scipy.stats
提供了我们在此设置中所需的一切。 Moreover, it offers estimation procedures as well as pdf/cdf functions for various parametric distributions.此外,它还为各种参数分布提供估计程序以及 pdf/cdf 函数。
from scipy import stats
from matplotlib import pyplot as plt
lower, upper = 0.0, 2.0
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit parameter
loc_hat, scale_hat = stats.norm.fit(x)
# probability
p = stats.norm.cdf(upper, loc=loc_hat, scale=scale_hat) - stats.norm.cdf(lower, loc=loc_hat, scale=scale_hat)
# plot
x_axis = np.linspace(-5, 7, 1000)
plt.title('1. Parametric Estimation', fontsize=18)
plt.plot(x_axis, stats.norm.pdf(x_axis, loc_hat, scale_hat))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = stats.norm.pdf(np.arange(lower, upper, 0.01), loc=loc_hat, scale=scale_hat) ,
facecolor='red',
alpha=0.35)
plt.text(x=0.1, y=0.1, s= 'p=' + str(round(p, 3)))
plt.show()
which yields产生
In the absence of a parametric assumption, various techniques exist to estimate the density directly (rather than identifying it by estimated parameters as seen above).在没有参数假设的情况下,存在各种直接估计密度的技术(而不是通过上面看到的估计参数来识别它)。 Kernel density estimation is the most popular variant to do so.
Kernel 密度估计是最流行的变体。 In this case, as alluded in the question,
scikit-learn
is an ideal tool.在这种情况下,正如问题中提到的,
scikit-learn
是一个理想的工具。 However, in the absence of an analytical CDF, we need to compute the area enclosed by the density's graph over the interval [lower, upper]
directly.然而,在没有解析 CDF 的情况下,我们需要直接计算区间
[lower, upper]
的密度图所包围的面积。
In contrast to previous answers, I'd leave this to SciPy's numerical integration routines, eg scipy.inegrate.quad()
.与以前的答案相比,我将其留给 SciPy 的数值积分例程,例如
scipy.inegrate.quad()
。 The advantage is that it is lightning-fast and can be applied to any function (beyond kernel density estimates).优点是速度快如闪电,可应用于任何 function(超出 kernel 密度估计)。 The resulting code is as follows
结果代码如下
from sklearn.neighbors import KernelDensity
from scipy.integrate import quad
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit density function
f_hat = KernelDensity(bandwidth=.9, kernel='gaussian').fit(np.array(x).reshape(-1, 1))
def f_pred(x):
'''wrapper function to compute probability'''
return np.exp(f_hat.score_samples(np.array(x).reshape(-1, 1)))[0]
p = quad(func=f_pred, a=lower, b=upper)
# plot
plt.title('2. Non-Parametric Estimation', fontsize=18)
xaxis = np.linspace(-5, 7, 1000)
plt.plot(x_axis, np.exp(f_hat.score_samples(xaxis.reshape(-1, 1))))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = np.exp(f_hat.score_samples(np.arange(lower, upper, 0.01).reshape(-1, 1))),
facecolor='red',
alpha=0.35)
plt.text(x=0.15, y=0.1, s= 'p=' + str(round(p[0], 3)))
plt.show()
and yields和产量
I do see a bug in the get_probability
function, but that bug causes it to compute a too high result - in np.sum(kd_vals * step)
, it's multiplying N sample values by a step with N-1
in the denominator, effectively resulting in an output a factor of N/(N-1)
too high.我确实在
get_probability
function 中看到了一个错误,但该错误导致它计算出的结果太高- 在np.sum(kd_vals * step)
中,它将 N 个样本值乘以N-1
分母,有效地导致在 output 中, N/(N-1)
的系数太高了。 (If they wanted to use a trapezoid rule computation for the integral, they should have divided the left and right endpoint values by 2 first.) (如果他们想对积分使用梯形规则计算,他们应该首先将左右端点值除以 2。)
Other than that, the computation looks correct.除此之外,计算看起来是正确的。 The problem is that the model doesn't reflect the input distribution.
问题是 model 没有反映输入分布。
You're not modeling the distribution as a normal distribution.您没有将分布建模为正态分布。 You're modeling it with a kernel density estimator with a Gaussian kernel, and the kernel bandwidth is very high relative to the scale of the distribution and the number of available samples.
您使用kernel 密度估计器和高斯 kernel 对其进行建模,并且 kernel 带宽相对于分布规模和可用样本数量非常高。 This results in the model being "flatter" than the actual distribution, with less of the probability concentrated in the center.
这导致 model 比实际分布“更平坦”,集中在中心的概率较小。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.