简体   繁体   English

Python QQ和PP两个不等长的分布图

[英]Python Q-Q and P-P plot of two distributions of unequal length

I am not sure what the best/most statistically sound way to accomplish what I want is, but I am basically trying to take a distribution of p-values and compare it to a much larger distribution of p-values created by permuting my original data. 我不确定实现我想要的最佳/最具统计学性的方法是什么,但我基本上试图分配p值并将其与通过置换原始数据创建的更大的p值分布进行比较。 I am working with small p-values, so I am actually comparing the log10 of the p-values. 我正在使用小的p值,所以我实际上比较了p值的log10。

I have been trying to figure out a good general way to compare two arrays with similar values but unequal lengths. 我一直试图找出一种比较两个具有相似值但长度不等的数组的通用方法。 What I really want is something like scipy.qqplot(dataset1, dataset2) , but that doesn't exist, the QQ plot only compares your distribution to an established distribution (this question has been asked for R also: https://stats.stackexchange.com/questions/12392/how-to-compare-two-datasets-with-qq-plot-using-ggplot2 ). 我真正想要的是像scipy.qqplot(dataset1, dataset2) ,但这不存在,QQ图只比较你的分布与已建立的分布(这个问题也被要求R也是: https:// stats。 stackexchange.com/questions/12392/how-to-compare-two-datasets-with-qq-plot-using-ggplot2 )。

Essentially this amounts to comparing two histograms. 基本上这相当于比较两个直方图。 I can use np.linspace to force the exact same bins for each distribution: 我可以使用np.linspace为每个分发强制使用完全相同的bin:

bins = 100
mx = max(np.max(vector1), np.max(vector2))
mn = min(np.min(vector2), np.max(vector2))
boundaries = np.linspace(mn, mx, bins, endpoint=True)
labels = [(boundaries[i]+boundaries[i+1])/2 for i in range(len(boundaries)-1)]

I can then easily use these boundaries and labels to make two histograms, weighted by the length of the original vectors. 然后我可以轻松地使用这些边界和标签来制作两个直方图,按原始矢量的长度加权。 The easiest thing to do is just just use a few bins and plot them as histograms on the same axis, like in this question: 最简单的方法就是使用几个箱子并将它们绘制成同一轴上的直方图,就像这个问题一样:

However, I really want something more like a QQ plot, and I want to use a lot of bins, so that I can see even small deviations from the 1-to-1 line. 但是,我真的想要更像QQ情节的东西,我想要使用很多垃圾箱,这样我就可以看到1对1线的偏差。 The problem with just plotting the two histograms, is that they look like this: 只绘制两个直方图的问题是它们看起来像这样:

histogram_example

The two plots are just right on top of each other, I can't see anything. 这两个地块正好在彼此之上,我看不到任何东西。

So what I want to figure out, is how to compare these two histograms while maintaining the bin labels . 所以我想弄清楚的是,如何在保持bin标签的同时比较这两个直方图。 I can easily plot the two against each other as a scatter graph, but that ends up being indexed by the bin frequency: 我可以很容易地将两者相互映射为散点图,但最终会被bin频率编入索引:

绝对错了

What I really want, is to just compare the two histograms, or to make a QQ plot of the differences, but I cannot come up with a good statistically sound way of doing this. 我真正想要的是比较两个直方图,或者制作差异的QQ图,但我无法想出一个好的统计上合理的方法。 I can find no methods that allow me to make a QQ plot with two datasets instead of one dataset and a built in distribution, and I can't find any way of plotting two distributions of unequal length against each other. 我找不到允许我用两个数据集而不是一个数据集和内置分布制作QQ图的方法,我找不到任何方法来绘制两个不等长度的分布。

For reference, here are the two histograms that went into creating that plot, you can see that they are extremely similar: 作为参考,这里是创建该图的两个直方图,您可以看到它们非常相似:

直方图

I know there must be a good way of doing this, because it seems so obvious, but I am new to this kind of thing, and relatively new to scipy, pandas, and statsmodels also. 我知道必须有一个很好的方法来做到这一点,因为它看起来很明显,但我对这种事情不熟悉,对于scipy,pandas和statsmodels来说也相对较新。

I intentionally have not provided an example distribution here, because I wasn't sure how to make a minimal set of arrays that were non-normally distributed and captured what I am trying to do; 我故意没有在这里提供一个示例分发,因为我不知道如何制作一组非正态分布的最小数组并捕获我想要做的事情; plus the point is to be able to do this for any two overlapping unequal-length arrays. 加上关键是能够为任何两个重叠的不等长数组执行此操作。

What I want to know is what is the right/best way to approach this problem in python in a statistically sound way? 我想知道的是,以统计上合理的方式在python中解决这个问题的正确/最佳方法是什么? Is there some way of creating a distribution from the permuted data that could be used for a statsmodels or scipy QQ plot? 是否有某种方法可以根据可用于statsmodels或scipy QQ情节的置换数据创建分布? Is there a way to compare two histograms visually like this already? 有没有办法比较直观地比较两个直方图? Is there a way of making probability plots that I don't know about? 有没有办法制作我不知道的概率图?


Edit: Trying cumulative and manual QQ plots 编辑:尝试累积和手动QQ图

Thanks to @user333700's answer, I figured out how to create a manual QQ plot for the data, and also a cumulative probability plot. 感谢@ user333700的回答,我想出了如何为数据创建手动QQ图,以及累积概率图。 I created the plots using data with an overlapping min/max but the following distributions: 我使用具有重叠最小值/最大值但以下分布的数据创建了图:

制造的分销

QQ plot: QQ情节:

q = np.linspace(0, 100, 101)
fig, ax = plt.subplots()
ax.scatter(np.percentile(ytest, q), np.percentile(xtest, q))

qqplot

So that works really well with simple data, the cumulative plot is similar: 因此,对于简单数据非常有效,累积图类似:

# Pick bins
x = ytest
y = xtest
boundaries = sorted(x)[::round(len(x)/bins)+1]
labels = [(boundaries[i]+boundaries[i+1])/2 for i in range(len(boundaries)-1)]

# Bin two series into equal bins
xb = pd.cut(x, bins=boundaries, labels=labels)
yb = pd.cut(y, bins=boundaries, labels=labels)

# Get value counts for each bin and sort by bin
xhist = xb.value_counts().sort_index(ascending=True)/len(xb)
yhist = yb.value_counts().sort_index(ascending=True)/len(yb)

# Make cumulative
for ser in [xhist, yhist]:
    ttl = 0
    for idx, val in ser.iteritems():
        ttl += val
        ser.loc[idx] = ttl

# Plot it
fig, ax = plt.subplots(figsize=(6,6))
ax.scatter(xhist, yhist)
plt.show()

累积情节

Going back to my actual skewed data (where the two distributions are extremely similar in every way except the lengths) and adding a 1-to-1 line, I get this for those two: 回到我的实际偏斜数据(两个发行版除了长度以外的各种方式非常相似)并添加一对一的行,我得到这两个:

有真实数据的情节

So both work, which is great, and the cumulative probability plot shows quite clearly that there is no large difference in the data, but the QQ plot shows that there is a small difference in the tail. 因此,两者的工作都很好,而且累积概率图非常清楚地表明数据没有大的差异,但QQ图表显示尾部有一个小的差异。

In terms of statistical tests, scipy has a two sample Kolmogorov-Smirnov test for the continuous variables. 在统计测试方面,scipy对连续变量进行了两个样本Kolmogorov-Smirnov检验。 The binned histogram data can be used with a chisquare test. 分箱直方图数据可以与chisquare测试一起使用。 scipy.stats also has a k-sample Anderson-Darling test. scipy.stats还有一个k样本的Anderson-Darling测试。

For plotting: 用于绘图:

The equivalent of a probability plot for two histograms would be to plot the cumulative frequencies for the two samples, ie with cumulative probabilities on each axis corresponding to the bin boundaries. 两个直方图的概率图的等价物将是绘制两个样本的累积频率,即每个轴上对应于区间边界的累积概率。

statsmodels has a qq-plot for two sample comparison, however it currently assumes that the sample sizes are the same. statsmodels有一个qq-plot用于两个样本比较,但它目前假设样本大小相同。 If the sample sizes are different, then the quantiles need to be computed for the same probabilities. 如果样本大小不同,则需要针对相同的概率计算分位数。 https://github.com/statsmodels/statsmodels/issues/2896 https://github.com/statsmodels/statsmodels/pull/3169 (I don't remember what the status of this is.) https://github.com/statsmodels/statsmodels/issues/2896 https://github.com/statsmodels/statsmodels/pull/3169 (我不记得这是什么状态。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM