[英]Binning Pandas column values by standard deviation centered on average?
I have a Pandas data frame with a bunch of values in sorted order: 我有一个Pandas数据框,其中包含一系列按排序顺序排列的值:
df = pd.DataFrame(np.arange(1,21))
I want to end up with a list/array like this: 我想最终得到这样的列表/数组:
[0,1.62,4.58,7.54,10.5,13.45,16.4,19.37,20]
The first and last element are df.min()
and df.max()
, the center element is the df.mean()
of the dataframe, and the surrounding elements are all in increments in of 0.5*df.std()
第一个和最后一个元素是df.min()
和df.max()
,中心元素是数据帧的df.mean()
,周围的元素都是0.5*df.std()
增量
Is there a way to vectorize this for large DataFrames? 有没有办法对大型DataFrame进行矢量化?
UPDATE (Efficient method is in the answers below!) 更新(有效的方法在下面的答案!)
a = np.arange(df[0].mean(),df[0].min(),-0.5*df[0].std())
b = np.arange(df[0].mean(),df[0].max(),0.5*df[0].std())
c = np.concatenate((a,b))
c = np.append(c,[df[0].min(),df[0].max()])
c = np.unique(c)
And then use np.digitize()
to move values to appropriate bins. 然后使用np.digitize()
将值移动到适当的bin。
If you find a more efficient way though, that would be helpful! 如果你找到一种更有效的方法,那将会有所帮助!
mu_sig
calculates various multiples of standard deviations by multiplying [-2, -1, 0, 1, 2]
by sigma. mu_sig
通过将[-2, -1, 0, 1, 2]
乘以西格玛来计算各种标准偏差。
edges
takes a series and gets mu_sig
results. edges
需要一系列并得到mu_sig
结果。 Then checks to see that the series minimum is less then minimum multiple of standard deviation less the mean. 然后检查系列最小值是否小于标准差的最小倍数减去平均值。 If it is, then prepend it to list. 如果是,则将其添加到列表中。 Do the same check for max. 做同样的检查最大值。
def edges(s, n=7, rnd=2, sig_mult=1):
mu = s.mean()
sig = s.std()
mn = s.min()
mx = s.max()
sig = np.arange(-n // 2, (n + 1) // 2 + 1) * sig * sig_mult
ms = (mu + sig)
# Checking if mins and maxs are in range of sigs
if mn < ms.min():
ms = np.concatenate([[mn], ms])
if mx > max(ms):
ms = np.concatenate([ms, [mx]])
return ms.round(rnd).tolist()
It works on a series, so I'll squeeze your dataframe 它适用于一系列,所以我会挤压你的数据帧
df = pd.DataFrame(np.arange(1,21))
s = df.squeeze()
Then use edges
然后使用edges
edges(s, sig_mult=.5, n=5)
[1, 1.63, 4.58, 7.54, 10.5, 13.46, 16.42, 19.37, 20]
edges(s)
[1, -13.16, -7.25, -1.33, 4.58, 10.5, 16.42, 22.33, 28.25, 34.16, 20]
This returns a list of length 11 by default. 这将返回默认长度为11的列表。 You can pass n
to get different length lists. 您可以传递n
以获取不同的长度列表。
edges(s, n=3)
[1, -1.33, 4.58, 10.5, 16.42, 22.33, 20]
Anticipating that you may want to change this to different multiples of standard deviation, you can also do: 预计您可能希望将其更改为不同的标准偏差倍数,您还可以执行以下操作:
edges(df, n=3, sig_mult=.2)
[1, 8.13, 9.32, 10.5, 11.68, 12.87, 20]
Series of length 20 系列长度20
Series of length 1,000,000 系列长度1,000,000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.