简体   繁体   English

如何使用 python/numpy 计算百分位数?

[英]How do I calculate percentiles with python/numpy?

Is there a convenient way to calculate percentiles for a sequence or single-dimensional numpy array?有没有一种方便的方法来计算序列或一维 numpy 数组的百分位数?

I am looking for something similar to Excel's percentile function.我正在寻找类似于 Excel 的百分位函数的东西。

I looked in NumPy's statistics reference, and couldn't find this.我查看了 NumPy 的统计参考,没有找到。 All I could find is the median (50th percentile), but not something more specific.我能找到的只是中位数(第 50 个百分位数),但不是更具体的东西。

You might be interested in the SciPy Stats package.您可能对SciPy Stats包感兴趣。 It has the percentile function you're after and many other statistical goodies.它具有您所追求的百分位数功能以及许多其他统计功能。

percentile() is available in numpy too. percentile() 也可以numpy

import numpy as np
a = np.array([1,2,3,4,5])
p = np.percentile(a, 50) # return 50th percentile, e.g median.
print p
3.0

This ticket leads me to believe they won't be integrating percentile() into numpy anytime soon.这张票让我相信他们不会很快将percentile()集成到 numpy 中。

By the way, there is a pure-Python implementation of percentile function , in case one doesn't want to depend on scipy.顺便说一下,有一个百分位函数的纯 Python 实现,以防万一人们不想依赖 scipy。 The function is copied below:函数复制如下:

## {{{ http://code.activestate.com/recipes/511478/ (r1)
import math
import functools

def percentile(N, percent, key=lambda x:x):
    """
    Find the percentile of a list of values.

    @parameter N - is a list of values. Note N MUST BE already sorted.
    @parameter percent - a float value from 0.0 to 1.0.
    @parameter key - optional key function to compute value from each element of N.

    @return - the percentile of the values
    """
    if not N:
        return None
    k = (len(N)-1) * percent
    f = math.floor(k)
    c = math.ceil(k)
    if f == c:
        return key(N[int(k)])
    d0 = key(N[int(f)]) * (c-k)
    d1 = key(N[int(c)]) * (k-f)
    return d0+d1

# median is 50th percentile.
median = functools.partial(percentile, percent=0.5)
## end of http://code.activestate.com/recipes/511478/ }}}
import numpy as np
a = [154, 400, 1124, 82, 94, 108]
print np.percentile(a,95) # gives the 95th percentile

Here's how to do it without numpy, using only python to calculate the percentile.这是没有 numpy 的方法,仅使用 python 来计算百分位数。

import math

def percentile(data, perc: int):
    size = len(data)
    return sorted(data)[int(math.ceil((size * perc) / 100)) - 1]

percentile([10.0, 9.0, 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0], 90)
# 9.0
percentile([142, 232, 290, 120, 274, 123, 146, 113, 272, 119, 124, 277, 207], 50)
# 146

Starting Python 3.8 , the standard library comes with the quantiles function as part of the statistics module:Python 3.8开始,标准库附带quantiles函数作为statistics模块的一部分:

from statistics import quantiles

quantiles([1, 2, 3, 4, 5], n=100)
# [0.06, 0.12, 0.18, 0.24, 0.3, 0.36, 0.42, 0.48, 0.54, 0.6, 0.66, 0.72, 0.78, 0.84, 0.9, 0.96, 1.02, 1.08, 1.14, 1.2, 1.26, 1.32, 1.38, 1.44, 1.5, 1.56, 1.62, 1.68, 1.74, 1.8, 1.86, 1.92, 1.98, 2.04, 2.1, 2.16, 2.22, 2.28, 2.34, 2.4, 2.46, 2.52, 2.58, 2.64, 2.7, 2.76, 2.82, 2.88, 2.94, 3.0, 3.06, 3.12, 3.18, 3.24, 3.3, 3.36, 3.42, 3.48, 3.54, 3.6, 3.66, 3.72, 3.78, 3.84, 3.9, 3.96, 4.02, 4.08, 4.14, 4.2, 4.26, 4.32, 4.38, 4.44, 4.5, 4.56, 4.62, 4.68, 4.74, 4.8, 4.86, 4.92, 4.98, 5.04, 5.1, 5.16, 5.22, 5.28, 5.34, 5.4, 5.46, 5.52, 5.58, 5.64, 5.7, 5.76, 5.82, 5.88, 5.94]
quantiles([1, 2, 3, 4, 5], n=100)[49] # 50th percentile (e.g median)
# 3.0

quantiles returns for a given distribution dist a list of n - 1 cut points separating the n quantile intervals (division of dist into n continuous intervals with equal probability): quantiles返回给定分布distn - 1分割点列表,这些分割点将n分位数区间分开(将dist划分为n等概率的连续区间):

statistics.quantiles(dist, *, n=4, method='exclusive') statistics.quantiles(dist, *, n=4, method='exclusive')

where n , in our case ( percentiles ) is 100 .其中n ,在我们的例子中( percentiles )是100

The definition of percentile I usually see expects as a result the value from the supplied list below which P percent of values are found... which means the result must be from the set, not an interpolation between set elements.我通常看到的百分位数的定义期望作为结果来自提供的列表中的值,在该列表下面找到 P 值的百分比......这意味着结果必须来自集合,而不是集合元素之间的插值。 To get that, you can use a simpler function.为此,您可以使用更简单的函数。

def percentile(N, P):
    """
    Find the percentile of a list of values

    @parameter N - A list of values.  N must be sorted.
    @parameter P - A float value from 0.0 to 1.0

    @return - The percentile of the values.
    """
    n = int(round(P * len(N) + 0.5))
    return N[n-1]

# A = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
# B = (15, 20, 35, 40, 50)
#
# print percentile(A, P=0.3)
# 4
# print percentile(A, P=0.8)
# 9
# print percentile(B, P=0.3)
# 20
# print percentile(B, P=0.8)
# 50

If you would rather get the value from the supplied list at or below which P percent of values are found, then use this simple modification:如果您更愿意从所提供的列表中获取等于或低于 P% 的值的值,请使用以下简单修改:

def percentile(N, P):
    n = int(round(P * len(N) + 0.5))
    if n > 1:
        return N[n-2]
    else:
        return N[0]

Or with the simplification suggested by @ijustlovemath:或者使用@ijustlovemath 建议的简化:

def percentile(N, P):
    n = max(int(round(P * len(N) + 0.5)), 2)
    return N[n-2]

检查 scipy.stats 模块:

 scipy.stats.scoreatpercentile

To calculate the percentile of a series, run:要计算一个系列的百分位数,请运行:

from scipy.stats import rankdata
import numpy as np

def calc_percentile(a, method='min'):
    if isinstance(a, list):
        a = np.asarray(a)
    return rankdata(a, method=method) / float(len(a))

For example:例如:

a = range(20)
print {val: round(percentile, 3) for val, percentile in zip(a, calc_percentile(a))}
>>> {0: 0.05, 1: 0.1, 2: 0.15, 3: 0.2, 4: 0.25, 5: 0.3, 6: 0.35, 7: 0.4, 8: 0.45, 9: 0.5, 10: 0.55, 11: 0.6, 12: 0.65, 13: 0.7, 14: 0.75, 15: 0.8, 16: 0.85, 17: 0.9, 18: 0.95, 19: 1.0}

A convenient way to calculate percentiles for a one-dimensional numpy sequence or matrix is by using numpy.percentile < https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html >.计算一维 numpy 序列或矩阵的百分位数的一种便捷方法是使用 numpy.percentile < https://docs.scipy.org/doc/numpy/reference/generated/numpy.percentile.html >。 Example:例子:

import numpy as np

a = np.array([0,1,2,3,4,5,6,7,8,9,10])
p50 = np.percentile(a, 50) # return 50th percentile, e.g median.
p90 = np.percentile(a, 90) # return 90th percentile.
print('median = ',p50,' and p90 = ',p90) # median =  5.0  and p90 =  9.0

However, if there is any NaN value in your data, the above function will not be useful.但是,如果您的数据中有任何 NaN 值,则上述函数将无用。 The recommended function to use in that case is the numpy.nanpercentile < https://docs.scipy.org/doc/numpy/reference/generated/numpy.nanpercentile.html > function:在这种情况下推荐使用的函数是 numpy.nanpercentile < https://docs.scipy.org/doc/numpy/reference/generated/numpy.nanpercentile.html > 函数:

import numpy as np

a_NaN = np.array([0.,1.,2.,3.,4.,5.,6.,7.,8.,9.,10.])
a_NaN[0] = np.nan
print('a_NaN',a_NaN)
p50 = np.nanpercentile(a_NaN, 50) # return 50th percentile, e.g median.
p90 = np.nanpercentile(a_NaN, 90) # return 90th percentile.
print('median = ',p50,' and p90 = ',p90) # median =  5.5  and p90 =  9.1

In the two options presented above, you can still choose the interpolation mode.在上面提供的两个选项中,您仍然可以选择插值模式。 Follow the examples below for easier understanding.请按照以下示例进行操作以更容易理解。

import numpy as np

b = np.array([1,2,3,4,5,6,7,8,9,10])
print('percentiles using default interpolation')
p10 = np.percentile(b, 10) # return 10th percentile.
p50 = np.percentile(b, 50) # return 50th percentile, e.g median.
p90 = np.percentile(b, 90) # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.9 , median =  5.5  and p90 =  9.1

print('percentiles using interpolation = ', "linear")
p10 = np.percentile(b, 10,interpolation='linear') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='linear') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='linear') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.9 , median =  5.5  and p90 =  9.1

print('percentiles using interpolation = ', "lower")
p10 = np.percentile(b, 10,interpolation='lower') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='lower') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='lower') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1 , median =  5  and p90 =  9

print('percentiles using interpolation = ', "higher")
p10 = np.percentile(b, 10,interpolation='higher') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='higher') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='higher') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  2 , median =  6  and p90 =  10

print('percentiles using interpolation = ', "midpoint")
p10 = np.percentile(b, 10,interpolation='midpoint') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='midpoint') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='midpoint') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  1.5 , median =  5.5  and p90 =  9.5

print('percentiles using interpolation = ', "nearest")
p10 = np.percentile(b, 10,interpolation='nearest') # return 10th percentile.
p50 = np.percentile(b, 50,interpolation='nearest') # return 50th percentile, e.g median.
p90 = np.percentile(b, 90,interpolation='nearest') # return 90th percentile.
print('p10 = ',p10,', median = ',p50,' and p90 = ',p90)
#p10 =  2 , median =  5  and p90 =  9

If your input array only consists of integer values, you might be interested in the percentil answer as an integer.如果您的输入数组仅包含整数值,您可能会对作为整数的百分比答案感兴趣。 If so, choose interpolation mode such as 'lower', 'higher', or 'nearest'.如果是这样,请选择插值模式,例如“更低”、“更高”或“最近”。

In case you need the answer to be a member of the input numpy array:如果您需要答案成为输入 numpy 数组的成员:

Just to add that the percentile function in numpy by default calculates the output as a linear weighted average of the two neighboring entries in the input vector.只是要补充一点,默认情况下 numpy 中的百分位数函数将输出计算为输入向量中两个相邻条目的线性加权平均值。 In some cases people may want the returned percentile to be an actual element of the vector, in this case, from v1.9.0 onwards you can use the "interpolation" option, with either "lower", "higher" or "nearest".在某些情况下,人们可能希望返回的百分位数是向量的实际元素,在这种情况下,从 v1.9.0 开始,您可以使用“插值”选项,“低”、“高”或“最近”。

import numpy as np
x=np.random.uniform(10,size=(1000))-5.0

np.percentile(x,70) # 70th percentile

2.075966046220879

np.percentile(x,70,interpolation="nearest")

2.0729677997904314

The latter is an actual entry in the vector, while the former is a linear interpolation of two vector entries that border the percentile后者是向量中的实际条目,而前者是与百分位数相邻的两个向量条目的线性插值

for a series: used describe functions对于一个系列:使用描述函数

suppose you have df with following columns sales and id.假设您有带有以下列销售和 ID 的 df。 you want to calculate percentiles for sales then it works like this,你想计算销售额的百分位数,然后它的工作原理是这样的,

df['sales'].describe(percentiles = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1])

0.0: .0: minimum
1: maximum 
0.1 : 10th percentile and so on

I bootstrap the data and then plotted out the confidence interval for 10 samples.我引导数据,然后绘制了 10 个样本的置信区间。 The confidence interval shows the range where the probabilities will fall between 5 percent and 95 percent probability.置信区间显示概率落在 5% 到 95% 之间的范围。

 import pandas as pd
 import matplotlib.pyplot as plt
 import seaborn as sns
 import numpy as np
 import json
 import dc_stat_think as dcst

 data = [154, 400, 1124, 82, 94, 108]
 #print (np.percentile(data,[0.5,95])) # gives the 95th percentile

 bs_data = dcst.draw_bs_reps(data, np.mean, size=6*10)

 #print(np.reshape(bs_data,(24,6)))

 x= np.linspace(1,6,6)
 print(x)
 for (item1,item2,item3,item4,item5,item6) in bs_data.reshape((10,6)):
     line_data=[item1,item2,item3,item4,item5,item6]
     ci=np.percentile(line_data,[.025,.975])
     mean_avg=np.mean(line_data)
     fig, ax = plt.subplots()
     ax.plot(x,line_data)
     ax.fill_between(x, (line_data-ci[0]), (line_data+ci[1]), color='b', alpha=.1)
     ax.axhline(mean_avg,color='red')
     plt.show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用 Python 和 Numpy 计算 r 平方? - How do I calculate r-squared using Python and Numpy? 如何在不使用 numpy 的情况下计算 python 的标准偏差? - How do I calculate standard deviation in python without using numpy? 如何在 Python 中找到组内每一行的加权百分位数? - How do I find weighted percentiles for each row within a group in Python? PYTHON:如何在不使用 pandas 或 numpy 的情况下计算数据帧的相关矩阵? - PYTHON: How do i calculate the correlation matrix of a data frame without using pandas or numpy? 如何从 numpy 数组中删除两个最小数字并计算中位数 python 3 - How do I remove two smallest numbers from a numpy array and calculate median python 3 如何获得 PySpark 中多个列的多个百分位数 - How do I get multiple percentiles for multiple columns in PySpark 如何使用 Python 按日期创建百分位数的表格视图 - How can I create a table view of percentiles by date using Python 如何在 Numpy 中的矩阵中计算 xi^j - How do I calculate xi^j in a matrix in Numpy 如何编写 Python 程序来计算给定数据集的十分位数、百分位数和分位数 - How to write a Python program to calculate the deciles, percentiles, and quantiles for a given data set 如何在Python IDLE中使用Numpy? - How do I use Numpy in Python IDLE?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM