简体   繁体   English

如何对 Pandas 系列进行分箱,将箱大小设置为每个箱的最大/最小预设值

[英]How can I bin a Pandas Series setting the bin size to a preset value of max/min for each bin

I have a pd.Series of floats and I would like to bin it into n bins where the bin size for each bin is set so that max/min is a preset value (eg 1.20)?我有一个 pd.Series 浮点数,我想将它分箱到n 个箱中,其中设置了每个箱的箱大小,以便最大/最小是预设值(例如 1.20)?

The requirement means that the size of the bins is not constant.该要求意味着箱的大小不是恒定的。 For example:例如:

data = pd.Series(np.arange(1, 11.0))
print(data)

0     1.0
1     2.0
2     3.0
3     4.0
4     5.0
5     6.0
6     7.0
7     8.0
8     9.0
9    10.0
dtype: float64

I would like the bin sizes to be:我希望垃圾箱大小为:

1.00 <= bin 1 < 1.20
1.20 <= bin 2 < 1.20 x 1.20 = 1.44
1.44 <= bin 3 < 1.44 x 1.20 = 1.73
...

etc ETC

Thanks谢谢

Here's one with pd.cut , where the bins can be computed taking the np.cumprod of an array filled with 1.2 :这是带有pd.cut的一个,其中可以使用填充有1.2的数组的bins来计算np.cumprod

data = pd.Series(list(range(11)))
import numpy as np

n = 20 # set accordingly
bins= np.r_[0,np.cumprod(np.full(n, 1.2))]
# array([ 0.        ,  1.2       ,  1.44      ,  1.728 ...
pd.cut(data, bins)

0                 NaN
1          (0.0, 1.2]
2      (1.728, 2.074]
3      (2.986, 3.583]
4        (3.583, 4.3]
5         (4.3, 5.16]
6       (5.16, 6.192]
7       (6.192, 7.43]
8       (7.43, 8.916]
9     (8.916, 10.699]
10    (8.916, 10.699]
dtype: category

Where bins in this case goes up to:在这种情况下,垃圾箱上升到:

np.r_[0,np.cumprod(np.full(20, 1.2))]

array([ 0.        ,  1.2       ,  1.44      ,  1.728     ,  2.0736    ,
        2.48832   ,  2.985984  ,  3.5831808 ,  4.29981696,  5.15978035,
        6.19173642,  7.43008371,  8.91610045, 10.69932054, 12.83918465,
       15.40702157, 18.48842589, 22.18611107, 26.62333328, 31.94799994,
       38.33759992])

So you'll have to set that according to the range of values of the actual data因此,您必须根据实际数据的值范围进行设置

This is I believe the best way to do it because you are considering the max and min values from your array.这是我认为最好的方法,因为您正在考虑数组中的maxmin Therefore you won't need to worry about what values are you using, only the multiplier or step_size for your bins (of course you'd need to add a column name or some additional information if you will be working with a DataFrame):因此,您无需担心您使用的是什么值,只需担心您的 bin 的multiplier或 step_size (当然,如果您将使用 DataFrame,您需要添加列名或一些其他信息):

data = pd.Series(np.arange(1, 11.0))
bins = []
i = min(data)
while i < max(data):
    bins.append(i)
    i = i*1.2
    bins.append(i)
bins = list(set(bins))
bins.sort()
df = pd.cut(data,bins,include_lowest=True)
print(df)

Output: Output:

0       (0.999, 1.2]
1     (1.728, 2.074]
2     (2.986, 3.583]
3       (3.583, 4.3]
4        (4.3, 5.16]
5      (5.16, 6.192]
6      (6.192, 7.43]
7      (7.43, 8.916]
8    (8.916, 10.699]
9    (8.916, 10.699]

Bins output:垃圾箱 output:

Categories (13, interval[float64]): [(0.999, 1.2] < (1.2, 1.44] < (1.44, 1.728] < (1.728, 2.074] < ... <
                                     (5.16, 6.192] < (6.192, 7.43] < (7.43, 8.916] <
                                     (8.916, 10.699]]

Thanks everyone for all the suggestions.感谢大家的所有建议。 None does quite what I was after (probably because my original question wasn't clear enough) but they really helped me figure out what to do so I have decided to post my own answer (I hope this is what I am supposed to do as I am relatively new at being an active member of stackoverflow...)没有人完全符合我的要求(可能是因为我最初的问题不够清楚),但他们确实帮助我弄清楚了该怎么做,所以我决定发布我自己的答案(我希望这是我应该做的作为stackoverflow的活跃成员,我相对较新......)

I liked @yatu's vectorised suggestion best because it will scale better with large data sets but I am after the means to not only automatically calculate the bins but also figure out the minimum number of bins needed to cover the data set.我最喜欢@yatu 的矢量化建议,因为它可以更好地适应大型数据集,但我追求的方法不仅是自动计算箱,而且还要找出覆盖数据集所需的最小箱数。

This is my proposed algorithm:这是我提出的算法:

  1. The bin size is defined so that bin_max_i/bin_min_i is constant:定义 bin 大小以使 bin_max_i/bin_min_i 保持不变:
bin_max_i / bin_min_i = bin_ratio
  1. Figure out the number of bins for the required bin size (bin_ratio):计算出所需 bin 大小 (bin_ratio) 的 bin 数量:
data_ratio = data_max / data_min
n_bins = math.ceil( math.log(data_ratio) / math.log(bin_ratio) )
  1. Set the lower boundary for the smallest bin so that the smallest data point fits in it:设置最小 bin 的下边界,以便最小的数据点适合它:
bin_min_0 = data_min
  1. Create n non-overlapping bins meeting the conditions:创建 n 个满足条件的非重叠 bin:
bin_min_i+1 = bin_max_i
bin_max_i+1 = bin_min_i+1 * bin_ratio
  1. Stop creating further bins once all dataset can be split between the bins already created.一旦可以在已创建的 bin 之间拆分所有数据集,就停止创建更多 bin。 In other words, stop once:换句话说,停止一次:
bin_max_last > data_max

Here is a code snippet:这是一个代码片段:

import math
import pandas as pd

bin_ratio = 1.20

data = pd.Series(np.arange(2,12))
data_ratio = max(data) / min(data)

n_bins = math.ceil( math.log(data_ratio) / math.log(bin_ratio) )
n_bins = n_bins + 1               # bin ranges are defined as [min, max)

bins = np.full(n_bins, bin_ratio) # initialise the ratios for the bins limits
bins[0] = bin_min_0               # initialise the lower limit for the 1st bin
bins = np.cumprod(bins)           # generate bins

print(bins)
[ 2.          2.4         2.88        3.456       4.1472      4.97664
  5.971968    7.1663616   8.59963392 10.3195607  12.38347284]

I am now set to build a histogram of the data:我现在准备构建数据的直方图:

data.hist(bins=bins)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM