简体   繁体   English

如何在python中手动生成没有逆分布函数的QQ图

[英]How to generate a Q-Q plot manually without inverse distribution function in python

I have 4 different distributions which I've fitted to a sample of observations. 我有4种不同的分布,我已经适应了观察样本。 Now I want to compare my results and find the best solution. 现在我想比较我的结果并找到最佳解决方案。 I know there are a lot of different methods to do that, but I'd like to use a quantile-quantile (qq) plot. 我知道有很多不同的方法可以做到这一点,但我想使用分位数 - 分位数(qq)图。

The formulas for my 4 distributions are: 我的4个发行版的公式是:

dist 1

dist 2

dist 3

dist 4

where K 0 is the modified Bessel function of the second kind and zeroth order, and Γ is the gamma function. 其中K 0是第二类和零阶的修正贝塞尔函数,Γ是伽马函数。

My sample style looks roughly like this: (0.2, 0.2, 0.2, 0.3, 0.3, 0.4, 0.4, 0.4, 0.4, 0.6, 0.7 ...), so I have multiple identical values and also gaps in between them. 我的样本样式看起来大致如下:(0.2,0.2,0.2,0.3,0.3,0.4,0.4,0.4,0.4,0.6,0.7 ......),所以我有多个相同的值,它们之间也有差距。

I've read the instructions on this site and tried to implement them in python. 我已阅读本网站上的说明并尝试在python中实现它们。 So, like in the link: 所以,就像链接一样:

1) I sorted my data from the smallest to the largest value. 1)我将数据从最小值排序到最大值。

2) I computed "n" evenly spaced points on the interval (0,1), where "n" is my sample size. 2)我在区间(0,1)上计算了“n”个均匀间隔点,其中“n”是我的样本大小。

3) And this is the point I can't manage. 3)这是我无法管理的。

As far as I understand, I should now use the values I calculated beforehand (those evenly spaced values), put them in the inverse functions of my above distributions and thus compute the theoretical quantiles of my distributions. 据我所知,我现在应该使用我预先计算的值(那些均匀间隔的值),将它们放在我上面的分布的反函数中,从而计算我的分布的理论分位数。

For reference, here are the inverse functions (partly calculated with wolframalpha , and as far it was possible): 作为参考,这里是反函数(部分用wolframalpha计算,并且尽可能):

invdist 1

invdist 2

invdist 3

invdist 4

where W is the Lambert W-function and everything in brackets afterwards is the argument. 其中W是Lambert W函数,之后括号中的所有内容都是参数。

The problem is, apparently there doesn't exist an inverse function for the first distribution. 问题是,显然第一次分布不存在反函数。 The next one would probably produce complex values (negative under the root, because b = 0.55 according to the fit) and the last two of them have a Lambert W-Function (where I'm unsecure how to implement them in python). 下一个可能会产生复杂的值(在根目录下为负数,因为b = 0.55,根据拟合),其中最后两个具有Lambert W-Function(我不安全如何在python中实现它们)。

So my question is, is there a way to calculate the qq plots without the analytical expressions of the inverse distribution functions? 所以我的问题是, 有没有办法计算qq图而没有逆分布函数的解析表达式?

I'd appreciate any help you could give me very much! 我非常感谢你能给我的任何帮助!

A simpler and more conventional way to go about this is to compute the log likelihood for each model and choose that one that has the greatest log likelihood. 更简单和更传统的方法是计算每个模型的对数似然,并选择具有最大对数似然的模型。 You don't need the cdf or quantile function for that, only the density function, which you have already. 你不需要cdf或quantile函数,只需要你已经拥有的密度函数。

The log likelihood is just the sum of log p(x|model) where p(x|model) is the probability density of datum x under a given model. 对数似然只是log p(x | model)的总和,其中p(x | model)是给定模型下数据x的概率密度。 Here "model" = model with parameters selected by maximizing the log likelihood over the possible values of the parameters. 这里“模型”=通过最大化参数的可能值的对数似然来选择参数的模型。

You can be more careful about this by integrating the log likelihood over the parameter space, taking into account also any prior probability assigned to each model; 您可以通过在参数空间中集成对数似然性来更加小心,同时还要考虑分配给每个模型的任何先验概率; that would be a Bayesian approach. 这将是贝叶斯方法。

It sounds like you are essentially looking to choose a model by minimizing the Kolmogorov-Smirnov (KS) statistic, which despite it's heavy name, is pretty simple -- it is the difference between the would-be quantile function and the empirical quantile. 听起来你本质上是想通过最小化Kolmogorov-Smirnov(KS)统计来选择模型,尽管它的名字很重,但它很简单 - 它是可能的分位数函数和经验分位数之间的差异。 That's defensible, but I think comparing log likelihoods is more conventional, and also simpler since you need only the pdf. 这是可辩护的,但我认为比较对数似然更常规,也更简单,因为你只需要pdf。

It happens that there is an easier way. 碰巧有一种更简单的方法。 It's taken me a day or two to dig around until I was pointed toward the right method in scipy.stats. 我花了一两天的时间去挖掘,直到我指向scipy.stats中的正确方法。 I was looking for the wrong sort of name! 我在寻找错误的名字!

First, build a subclass of rv_continuous to represent one of your distributions. 首先,构建一个rv_continuous的子类来表示您的一个发行版。 We know the pdf for your distributions, so that's what we define. 我们知道您的发行版的pdf,这就是我们定义的内容。 In this case there's just one parameter. 在这种情况下,只有一个参数。 If more are needed just add them to the def statement and use them in the return statement as required. 如果需要更多,只需将它们添加到def语句并根据需要在return语句中使用它们。

>>> from scipy import stats
>>> param = 3/2
>>> from math import exp
>>> class NoName(stats.rv_continuous):
...     def _pdf(self, x, param):
...         return param*exp(-param*x)
...     

Now create an instance of this object, declare the lower end of its support (ie, the lowest value that the rv can assume), and what the parameters are called. 现在创建此对象的实例,声明其支持的下端(即,rv可以采用的最低值),以及调用的参数。

>>> noname = NoName(a=0, shapes='param')

I don't have an actual sample of values to play with. 我没有可用的实际值。 I'll create a pseudo-random sample. 我将创建一个伪随机样本。

>>> sample = noname.rvs(size=100, param=param)

Sort it to make it into the so-called 'empirical cdf'. 对它进行排序使其成为所谓的“经验性cdf”。

>>> empirical_cdf = sorted(sample)

The sample has 100 elements, therefore generate 100 points at which to sample the inverse cdf, or quantile function, as discussed in the paper your referenced. 样本有100个元素,因此生成100个点来对逆cdf或分位数函数进行采样,如您引用的论文中所述。

>>> theoretical_points = [(_-0.5)/len(sample) for _ in range(1, 1+len(sample))]

Get the quantile function values at these points. 获取这些点的分位数函数值。

>>> theoretical_cdf = [noname.ppf(_, param=param) for _ in theoretical_points]

Plot it all. 把它全部搞定。

>>> from matplotlib import pyplot as plt
>>> plt.plot([0,3.5], [0, 3.5], 'b-')
[<matplotlib.lines.Line2D object at 0x000000000921B400>]
>>> plt.scatter(empirical_cdf, theoretical_cdf)
<matplotlib.collections.PathCollection object at 0x000000000921BD30>
>>> plt.show()

Here's the QQ plot that results. 这是结果的QQ情节。

Q-Q情节

Darn it ... Sorry, I was fixated on a slick solution to somehow bypass the missing inverse CDF and calculate the quantiles directly (and avoid any numerically approaches). Darn it ...对不起,我注意了一个光滑的解决方案,以某种方式绕过丢失的逆CDF并直接计算分位数(并避免任何数字方法)。 But it can also be done by simple brute force. 但它也可以通过简单的蛮力来完成。

At first you have to define the quantiles for your distributions yourself (for instance ten times more accurate than the original/empirical quantiles). 首先,您必须自己定义分布的分位数(例如,比原始/经验分位数准确十倍)。 Then you need to calculate the corresponding CDF values. 然后,您需要计算相应的CDF值。 Then you have to compare these values one by one with the ones which were calculated in step 2 in the question. 然后,您必须逐个将这些值与问题中步骤2中计算的值进行比较。 The according quantiles of the CDF values with the smallest deviations are the ones you were looking for. 具有最小偏差的CDF值的相应分位数是您正在寻找的。

The precision of this solution is limited by the resolution of the quantiles you defined yourself. 此解决方案的精度受限于您自己定义的分位数的分辨率。

But maybe I'm wrong and there is a more elegant way to solve this problem, then I would be happy to hear it! 但也许我错了,有一种更优雅的方式来解决这个问题,那么我很乐意听到它!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM