[英]Why do different percentiles give the same value?
I was trying to calculate 10 percentiles for a list of chi-squared distributed values.我试图计算卡方分布值列表的 10 个百分位数。 I used "chi-squared" because I think this is closest to what our real data looks like.
我使用“卡方”是因为我认为这最接近我们真实数据的样子。
Now I was trying to do this step-by-step to don´t miss anything.现在我正试图一步一步地做到这一点,以免错过任何东西。
import numpy as np
values = np.array([int(w)*10 for w in list(np.random.chisquare(6,1000))])
print('Min: ', np.max(values))
print('Max: ', np.min(values))
print('Mean: ', np.mean(values))
for p in [w*10 for w in range(1,11,1)]:
percentile = np.percentile(values,p)
print(p,percentile)
This is an example output of the code above:这是上面代码的示例输出:
Min: 0
Max: 230
Mean: 55.49
Percent: 10 Percentile: 20.0
Percent: 20 Percentile: 30.0
Percent: 30 Percentile: 30.0
Percent: 40 Percentile: 40.0
Percent: 50 Percentile: 50.0
Percent: 60 Percentile: 60.0
Percent: 70 Percentile: 70.0
Percent: 80 Percentile: 80.0
Percent: 90 Percentile: 100.0
Percent: 100 Percentile: 230.0
The point that I´m struggling at is:我正在努力的一点是:
why do I get the same "Percentile" for 20 & 30 percent?为什么我得到 20% 和 30% 的相同“百分位数”?
I always thought that 20 / 30 means: 20 percent of the values lay below the following value (in this case 30).我一直认为 20 / 30 意味着:20% 的值低于以下值(在本例中为 30)。 Like with 100 % of the values lay below 230 which is the maximum.
就像 100% 的值低于最大值 230 一样。
Which Idea am I missing?我错过了哪个想法?
Because values
was created with the expression int(w)*10
, all the values are integer multiples of 10. This means most of the values are repeated many times.因为
values
是用表达式int(w)*10
,所以所有值都是 10 的整数倍。这意味着大多数值被重复多次。 For example, I just ran that code and found that the value 30 was repeated 119 times.例如,我刚刚运行该代码并发现值 30 重复了 119 次。 It turns out that, when you count the values, the interquantile interval 20% - 30% contains only the value 30. That's why the values 30 is repeated in your output.
事实证明,当您对这些值进行计数时,分位数间隔 20% - 30%仅包含值 30。这就是为什么值 30 在您的输出中重复的原因。
I can break down my data set as我可以将我的数据集分解为
value #
0 14
10 72
20 100
30 119
40 152
etc.
Break this up into groups of 100 (since you have 1000 values, and you are looking at 10%, 20%, etc).将其分成 100 个一组(因为您有 1000 个值,并且您正在查看 10%、20% 等)。
np.percentile
Percent Group Values (counts) (largest value in previous column)
------- --------- ------------------------ ----------------------------------
10 0 - 99 0 (14), 10 (72), 20 (16) 20
20 100 - 199 20 (84), 30 (16) 30
30 200 - 299 30 (100) 30
40 300 - 399 30 (3), 40 (97) 40
etc.
Given the distribution that you used, this output seems to be the most likely, but if you rerun the code enough times, you'll encounter different output.鉴于您使用的发行版,此输出似乎是最有可能的,但是如果您重新运行代码足够多的次数,您将遇到不同的输出。 I just ran it again and got
我只是再次运行它并得到
10 20.0
20 20.0
30 30.0
40 40.0
50 50.0
60 50.0
70 60.0
80 80.0
90 100.0
100 210.0
Note that both 20.0 and 50.0 are repeated.请注意,20.0 和 50.0 都重复了。 The counts of the values for this run are:
此运行的值计数为:
In [56]: values, counts = np.unique(values, return_counts=True)
In [57]: values
Out[57]:
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120,
130, 140, 150, 160, 170, 180, 190, 210])
In [58]: counts
Out[58]:
array([ 14, 73, 129, 134, 134, 119, 105, 67, 73, 33, 41, 21, 19,
16, 8, 7, 1, 2, 2, 1, 1])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.