简体   繁体   English

在柱状图上运行Python

[英]Operating on histogram bins Python

I am trying to find the median of values within a bin range generated by the np.histrogram function. 我试图找到由np.histrogram函数生成的bin范围内的值的中值。 How would I select the values only within the bin range and operate on those specific values? 如何仅在bin范围内选择值并对这些特定值进行操作? Below is an example of my data and what I am trying to do: 下面是我的数据示例以及我要做的事情:

x = [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]

y values can have any sort of x value associated with them, for example: y值可以具有与它们关联的任何类型的x值,例如:

hist, bins = np.histogram(x)
hist = [129, 126, 94, 133, 179, 206, 142, 147, 90, 185] 
bins = [0.,         0.09999926, 0.19999853, 0.29999779, 0.39999706,
        0.49999632, 0.59999559, 0.69999485, 0.79999412, 0.8999933,
        0.99999265]

So, I am trying to find the median y value of the 129 values in the first bin generated, etc. 所以,我试图找到生成的第一个bin中的129个值的中值y值等。

One way is with pandas.cut() : 一种方法是使用pandas.cut()

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(444)

>>> x = np.random.randint(0, 25, size=100)
>>> _, bins = np.histogram(x)
>>> pd.Series(x).groupby(pd.cut(x, bins)).median()
(0.0, 2.4]       2.0
(2.4, 4.8]       3.0
(4.8, 7.2]       6.0
(7.2, 9.6]       8.5
(9.6, 12.0]     10.5
(12.0, 14.4]    13.0
(14.4, 16.8]    15.5
(16.8, 19.2]    18.0
(19.2, 21.6]    20.5
(21.6, 24.0]    23.0
dtype: float64

If you want to stay in NumPy, you might want to check out np.digitize() . 如果你想留在NumPy,你可能想看看np.digitize()

You can do this by slicing a sorted version of your data using the counts as indices: 您可以使用计数作为索引切片数据的排序版本来执行此操作:

x = np.random.rand(1000)
hist,bins = np.histogram(x)

ix = [0] + hist.cumsum().tolist()
# if don't mind sorting your original data, use x.sort() instead
xsorted = np.sort(x)
ix = [0] + hist.cumsum()
[np.median(x[i:j]) for i,j in zip(ix[:-1], ix[1:])]

which will out the medians as a standard Python list. 将medians作为标准Python列表。

np.digitize and np.searchsorted will match your data with bins. np.digitizenp.searchsorted将使您的数据与垃圾箱匹配。 The latter is preferable in this situation because it does fewer unnecessary checks (your bins can safely be assumed to be sorted). 在这种情况下,后者更可取,因为它可以减少不必要的检查(可以安全地假设您的箱子已经分类)。

If you look at the documentation of np.histogram (Notes section), you will notice that the bins are all half-open on the right (except the last one). 如果你看一下np.histogram (注释部分)的文档,你会注意到这些文件np.histogram在右边是半开的(除了最后一个)。 This means that you can do the following: 这意味着您可以执行以下操作:

x = np.abs(np.random.normal(loc=0.75, scale=0.75, size=10000))
h, b = np.histogram(x)
ind = np.searchsorted(b, x, side='right')

Now ind contains a label for each number indicating which bin it belongs to. 现在ind包含每个数字的标签,表示它属于哪个bin。 You can compute medians: 你可以计算中位数:

m = [np.median(x[ind == label]) for label in range(b.size - 1)]

If you are able to sort the input data, your job becomes easier because you can use views instead of extracting the data for each bin using masking. 如果您能够对输入数据进行排序,则您的工作变得更加容易,因为您可以使用视图而不是使用屏蔽为每个bin提取数据。 np.split is a good choice in this case: 在这种情况下, np.split是一个不错的选择:

x.sort()
sections = np.split(x, np.cumsum(h[:-1]))
m = [np.median(arr) for arr in sections]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM