简体   繁体   English

为什么不同的百分位数给出相同的值?

[英]Why do different percentiles give the same value?

I was trying to calculate 10 percentiles for a list of chi-squared distributed values.我试图计算卡方分布值列表的 10 个百分位数。 I used "chi-squared" because I think this is closest to what our real data looks like.我使用“卡方”是因为我认为这最接近我们真实数据的样子。

Now I was trying to do this step-by-step to don´t miss anything.现在我正试图一步一步地做到这一点,以免错过任何东西。

import numpy as np
values =  np.array([int(w)*10 for w in list(np.random.chisquare(6,1000))])
print('Min: ', np.max(values))
print('Max: ', np.min(values))
print('Mean: ', np.mean(values))

for p in [w*10 for w in range(1,11,1)]:
    percentile = np.percentile(values,p)
    print(p,percentile)

This is an example output of the code above:这是上面代码的示例输出:

Min:  0
Max:  230
Mean:  55.49
Percent: 10 Percentile:  20.0
Percent: 20 Percentile:  30.0
Percent: 30 Percentile:  30.0
Percent: 40 Percentile:  40.0
Percent: 50 Percentile:  50.0
Percent: 60 Percentile:  60.0
Percent: 70 Percentile:  70.0
Percent: 80 Percentile:  80.0
Percent: 90 Percentile:  100.0
Percent: 100 Percentile:  230.0

The point that I´m struggling at is:我正在努力的一点是:
why do I get the same "Percentile" for 20 & 30 percent?为什么我得到 20% 和 30% 的相同“百分位数”?
I always thought that 20 / 30 means: 20 percent of the values lay below the following value (in this case 30).我一直认为 20 / 30 意味着:20% 的值低于以下值(在本例中为 30)。 Like with 100 % of the values lay below 230 which is the maximum.就像 100% 的值低于最大值 230 一样。

Which Idea am I missing?我错过了哪个想法?

Because values was created with the expression int(w)*10 , all the values are integer multiples of 10. This means most of the values are repeated many times.因为values是用表达式int(w)*10 ,所以所有值都是 10 的整数倍。这意味着大多数值被重复多次。 For example, I just ran that code and found that the value 30 was repeated 119 times.例如,我刚刚运行该代码并发现值 30 重复了 119 次。 It turns out that, when you count the values, the interquantile interval 20% - 30% contains only the value 30. That's why the values 30 is repeated in your output.事实证明,当您对这些值进行计数时,分位数间隔 20% - 30%包含值 30。这就是为什么值 30 在您的输出中重复的原因。

I can break down my data set as我可以将我的数据集分解为

   value    #
     0     14
    10     72
    20    100
    30    119
    40    152
    etc.

Break this up into groups of 100 (since you have 1000 values, and you are looking at 10%, 20%, etc).将其分成 100 个一组(因为您有 1000 个值,并且您正在查看 10%、20% 等)。

                                                np.percentile
Percent  Group       Values (counts)            (largest value in previous column)
-------  ---------   ------------------------   ----------------------------------
10       0 - 99      0 (14), 10 (72), 20 (16)    20
20       100 - 199   20 (84), 30 (16)            30
30       200 - 299   30 (100)                    30
40       300 - 399   30 (3), 40 (97)             40
etc.

Given the distribution that you used, this output seems to be the most likely, but if you rerun the code enough times, you'll encounter different output.鉴于您使用的发行版,此输出似乎是最有可能的,但是如果您重新运行代码足够多的次数,您将遇到不同的输出。 I just ran it again and got我只是再次运行它并得到

10 20.0
20 20.0
30 30.0
40 40.0
50 50.0
60 50.0
70 60.0
80 80.0
90 100.0
100 210.0

Note that both 20.0 and 50.0 are repeated.请注意,20.0 和 50.0 都重复了。 The counts of the values for this run are:此运行的值计数为:

In [56]: values, counts = np.unique(values, return_counts=True)                                                             

In [57]: values                                                                                                             
Out[57]: 
array([  0,  10,  20,  30,  40,  50,  60,  70,  80,  90, 100, 110, 120,
       130, 140, 150, 160, 170, 180, 190, 210])

In [58]: counts                                                                                                             
Out[58]: 
array([ 14,  73, 129, 134, 134, 119, 105,  67,  73,  33,  41,  21,  19,
        16,   8,   7,   1,   2,   2,   1,   1])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么 powershell 和 python 对同一命令给出不同的结果? - Why do powershell and python give different result for same command? 为什么 matlab 和 python 对同一个数学问题给出不同的答案? - Why do matlab and python give different answers for the same mathematical question? 为什么两个随机森林模型在相同的数据上给出不同的结果 - Why do two random forest models give different results on the same data 为什么numpy和随机模块为同一种子提供不同的随机数? - Why do the numpy and random modules give different random numbers for the same seed? 为什么这两个代码给出相同的结果? - Why do these two codes give the same result? 为什么这些基本转换器给出不同的答案? - Why do these base converters give different answers? 为什么这两个命令给出不同的输出? - Why do these two command give different outputs? 为什么4种不同的语言会给出4种不同的结果? - Why do 4 different languages give 4 different results here? pytz:为什么这些不同的方法会给出不同的UTC偏移量? - pytz: Why do these different methods give different UTC offsets? 在 tensorflow 中,为什么当使用 50 epoch 的亚当优化器运行相同的 dropout 值为 0.8 时,每次运行时会给出不同的精度? - In tensorflow why for a same dropout value of 0.8 when run with adam optimiser with 50epochs give different accuracy each time i run it?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM