[英]Operating on histogram bins Python
I am trying to find the median of values within a bin range generated by the np.histrogram
function. 我试图找到由
np.histrogram
函数生成的bin范围内的值的中值。 How would I select the values only within the bin range and operate on those specific values? 如何仅在bin范围内选择值并对这些特定值进行操作? Below is an example of my data and what I am trying to do:
下面是我的数据示例以及我要做的事情:
x = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]
y values can have any sort of x value associated with them, for example: y值可以具有与它们关联的任何类型的x值,例如:
hist, bins = np.histogram(x)
hist = [129, 126, 94, 133, 179, 206, 142, 147, 90, 185]
bins = [0., 0.09999926, 0.19999853, 0.29999779, 0.39999706,
0.49999632, 0.59999559, 0.69999485, 0.79999412, 0.8999933,
0.99999265]
So, I am trying to find the median y value of the 129 values in the first bin generated, etc. 所以,我试图找到生成的第一个bin中的129个值的中值y值等。
One way is with pandas.cut()
: 一种方法是使用
pandas.cut()
:
>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(444)
>>> x = np.random.randint(0, 25, size=100)
>>> _, bins = np.histogram(x)
>>> pd.Series(x).groupby(pd.cut(x, bins)).median()
(0.0, 2.4] 2.0
(2.4, 4.8] 3.0
(4.8, 7.2] 6.0
(7.2, 9.6] 8.5
(9.6, 12.0] 10.5
(12.0, 14.4] 13.0
(14.4, 16.8] 15.5
(16.8, 19.2] 18.0
(19.2, 21.6] 20.5
(21.6, 24.0] 23.0
dtype: float64
If you want to stay in NumPy, you might want to check out np.digitize()
. 如果你想留在NumPy,你可能想看看
np.digitize()
。
You can do this by slicing a sorted version of your data using the counts as indices: 您可以使用计数作为索引切片数据的排序版本来执行此操作:
x = np.random.rand(1000)
hist,bins = np.histogram(x)
ix = [0] + hist.cumsum().tolist()
# if don't mind sorting your original data, use x.sort() instead
xsorted = np.sort(x)
ix = [0] + hist.cumsum()
[np.median(x[i:j]) for i,j in zip(ix[:-1], ix[1:])]
which will out the medians as a standard Python list. 将medians作为标准Python列表。
np.digitize
and np.searchsorted
will match your data with bins. np.digitize
和np.searchsorted
将使您的数据与垃圾箱匹配。 The latter is preferable in this situation because it does fewer unnecessary checks (your bins can safely be assumed to be sorted). 在这种情况下,后者更可取,因为它可以减少不必要的检查(可以安全地假设您的箱子已经分类)。
If you look at the documentation of np.histogram
(Notes section), you will notice that the bins are all half-open on the right (except the last one). 如果你看一下
np.histogram
(注释部分)的文档,你会注意到这些文件np.histogram
在右边是半开的(除了最后一个)。 This means that you can do the following: 这意味着您可以执行以下操作:
x = np.abs(np.random.normal(loc=0.75, scale=0.75, size=10000))
h, b = np.histogram(x)
ind = np.searchsorted(b, x, side='right')
Now ind
contains a label for each number indicating which bin it belongs to. 现在
ind
包含每个数字的标签,表示它属于哪个bin。 You can compute medians: 你可以计算中位数:
m = [np.median(x[ind == label]) for label in range(b.size - 1)]
If you are able to sort the input data, your job becomes easier because you can use views instead of extracting the data for each bin using masking. 如果您能够对输入数据进行排序,则您的工作变得更加容易,因为您可以使用视图而不是使用屏蔽为每个bin提取数据。
np.split
is a good choice in this case: 在这种情况下,
np.split
是一个不错的选择:
x.sort()
sections = np.split(x, np.cumsum(h[:-1]))
m = [np.median(arr) for arr in sections]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.