为什么不同的百分位数给出相同的值？

Question

I was trying to calculate 10 percentiles for a list of chi-squared distributed values.我试图计算卡方分布值列表的 10 个百分位数。 I used "chi-squared" because I think this is closest to what our real data looks like.我使用“卡方”是因为我认为这最接近我们真实数据的样子。

Now I was trying to do this step-by-step to don´t miss anything.现在我正试图一步一步地做到这一点，以免错过任何东西。

import numpy as np
values =  np.array([int(w)*10 for w in list(np.random.chisquare(6,1000))])
print('Min: ', np.max(values))
print('Max: ', np.min(values))
print('Mean: ', np.mean(values))

for p in [w*10 for w in range(1,11,1)]:
    percentile = np.percentile(values,p)
    print(p,percentile)

This is an example output of the code above:这是上面代码的示例输出：

Min:  0
Max:  230
Mean:  55.49
Percent: 10 Percentile:  20.0
Percent: 20 Percentile:  30.0
Percent: 30 Percentile:  30.0
Percent: 40 Percentile:  40.0
Percent: 50 Percentile:  50.0
Percent: 60 Percentile:  60.0
Percent: 70 Percentile:  70.0
Percent: 80 Percentile:  80.0
Percent: 90 Percentile:  100.0
Percent: 100 Percentile:  230.0

The point that I´m struggling at is:我正在努力的一点是：
why do I get the same "Percentile" for 20 & 30 percent?为什么我得到 20% 和 30% 的相同“百分位数”？
I always thought that 20 / 30 means: 20 percent of the values lay below the following value (in this case 30).我一直认为 20 / 30 意味着：20% 的值低于以下值（在本例中为 30）。 Like with 100 % of the values lay below 230 which is the maximum.就像 100% 的值低于最大值 230 一样。

Which Idea am I missing?我错过了哪个想法？

Answer 1

Because values was created with the expression int(w)*10 , all the values are integer multiples of 10. This means most of the values are repeated many times.因为values是用表达式int(w)*10 ，所以所有值都是 10 的整数倍。这意味着大多数值被重复多次。 For example, I just ran that code and found that the value 30 was repeated 119 times.例如，我刚刚运行该代码并发现值 30 重复了 119 次。 It turns out that, when you count the values, the interquantile interval 20% - 30% contains only the value 30. That's why the values 30 is repeated in your output.事实证明，当您对这些值进行计数时，分位数间隔 20% - 30%仅包含值 30。这就是为什么值 30 在您的输出中重复的原因。

I can break down my data set as我可以将我的数据集分解为

Break this up into groups of 100 (since you have 1000 values, and you are looking at 10%, 20%, etc).将其分成 100 个一组（因为您有 1000 个值，并且您正在查看 10%、20% 等）。

                                                np.percentile
Percent  Group       Values (counts)            (largest value in previous column)
-------  ---------   ------------------------   ----------------------------------
10       0 - 99      0 (14), 10 (72), 20 (16)    20
20       100 - 199   20 (84), 30 (16)            30
30       200 - 299   30 (100)                    30
40       300 - 399   30 (3), 40 (97)             40
etc.

Given the distribution that you used, this output seems to be the most likely, but if you rerun the code enough times, you'll encounter different output.鉴于您使用的发行版，此输出似乎是最有可能的，但是如果您重新运行代码足够多的次数，您将遇到不同的输出。 I just ran it again and got我只是再次运行它并得到

Note that both 20.0 and 50.0 are repeated.请注意，20.0 和 50.0 都重复了。 The counts of the values for this run are:此运行的值计数为：

In [56]: values, counts = np.unique(values, return_counts=True)                                                             

In [57]: values                                                                                                             
Out[57]: 
array([  0,  10,  20,  30,  40,  50,  60,  70,  80,  90, 100, 110, 120,
       130, 140, 150, 160, 170, 180, 190, 210])

In [58]: counts                                                                                                             
Out[58]: 
array([ 14,  73, 129, 134, 134, 119, 105,  67,  73,  33,  41,  21,  19,
        16,   8,   7,   1,   2,   2,   1,   1])

为什么不同的百分位数给出相同的值？

问题描述

1 个解决方案

解决方案1
2 已采纳 2020-01-27 16:37:02

为什么不同的百分位数给出相同的值？

问题描述

1 个解决方案

解决方案1 2 已采纳 2020-01-27 16:37:02

解决方案1
2 已采纳 2020-01-27 16:37:02