简体   繁体   English

Binning Pandas列值是否以标准偏差为中心?

[英]Binning Pandas column values by standard deviation centered on average?

I have a Pandas data frame with a bunch of values in sorted order: 我有一个Pandas数据框,其中包含一系列按排序顺序排列的值:

df = pd.DataFrame(np.arange(1,21))

I want to end up with a list/array like this: 我想最终得到这样的列表/数组:

[0,1.62,4.58,7.54,10.5,13.45,16.4,19.37,20]

The first and last element are df.min() and df.max() , the center element is the df.mean() of the dataframe, and the surrounding elements are all in increments in of 0.5*df.std() 第一个和最后一个元素是df.min()df.max() ,中心元素是数据帧的df.mean() ,周围的元素都是0.5*df.std()增量

Is there a way to vectorize this for large DataFrames? 有没有办法对大型DataFrame进行矢量化?

UPDATE (Efficient method is in the answers below!) 更新(有效的方法在下面的答案!)

a = np.arange(df[0].mean(),df[0].min(),-0.5*df[0].std())
b = np.arange(df[0].mean(),df[0].max(),0.5*df[0].std())
c = np.concatenate((a,b))
c = np.append(c,[df[0].min(),df[0].max()])
c = np.unique(c)

And then use np.digitize() to move values to appropriate bins. 然后使用np.digitize()将值移动到适当的bin。

If you find a more efficient way though, that would be helpful! 如果你找到一种更有效的方法,那将会有所帮助!

mu_sig calculates various multiples of standard deviations by multiplying [-2, -1, 0, 1, 2] by sigma. mu_sig通过将[-2, -1, 0, 1, 2]乘以西格玛来计算各种标准偏差。

edges takes a series and gets mu_sig results. edges需要一系列并得到mu_sig结果。 Then checks to see that the series minimum is less then minimum multiple of standard deviation less the mean. 然后检查系列最小值是否小于标准差的最小倍数减去平均值。 If it is, then prepend it to list. 如果是,则将其添加到列表中。 Do the same check for max. 做同样的检查最大值。

def edges(s, n=7, rnd=2, sig_mult=1):
    mu = s.mean()
    sig = s.std()
    mn = s.min()
    mx = s.max()

    sig = np.arange(-n // 2, (n + 1) // 2 + 1) * sig * sig_mult
    ms = (mu + sig)

    # Checking if mins and maxs are in range of sigs
    if mn < ms.min():
        ms = np.concatenate([[mn], ms])
    if mx > max(ms):
        ms = np.concatenate([ms, [mx]])

    return ms.round(rnd).tolist()

It works on a series, so I'll squeeze your dataframe 它适用于一系列,所以我会挤压你的数据帧

df = pd.DataFrame(np.arange(1,21))
s = df.squeeze()

Then use edges 然后使用edges

THIS IS YOUR ANSWER 这是你的回答

edges(s, sig_mult=.5, n=5)

[1, 1.63, 4.58, 7.54, 10.5, 13.46, 16.42, 19.37, 20]

edges(s)

[1, -13.16, -7.25, -1.33, 4.58, 10.5, 16.42, 22.33, 28.25, 34.16, 20]

This returns a list of length 11 by default. 这将返回默认长度为11的列表。 You can pass n to get different length lists. 您可以传递n以获取不同的长度列表。

edges(s, n=3)

[1, -1.33, 4.58, 10.5, 16.42, 22.33, 20]

Anticipating that you may want to change this to different multiples of standard deviation, you can also do: 预计您可能希望将其更改为不同的标准偏差倍数,您还可以执行以下操作:

edges(df, n=3, sig_mult=.2)

[1, 8.13, 9.32, 10.5, 11.68, 12.87, 20]

Timing 定时

Series of length 20 系列长度20

在此输入图像描述

Series of length 1,000,000 系列长度1,000,000

在此输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM