简体   繁体   English

Pandas.cut VS df.describe()

[英]Pandas.cut VS df.describe()

I would like to group the datas with 4 ranges,and I used Pandas.cut to bin,here is my code and result我想用 4 个范围对数据进行分组,我使用 Pandas.cut 进行 bin,这是我的代码和结果

剪切图像

Then I used df.describe() and I found the ranges with the edges are different with pd.cut,why?然后我使用df.describe()发现边缘的范围与 pd.cut 不同,为什么?

描述

pd.cut is [(2.719, 3.042] < (3.042, 3.365] < (3.365, 3.688] < (3.688, 4.01]] pd.cut[(2.719, 3.042] < (3.042, 3.365] < (3.365, 3.688] < (3.688, 4.01]]

df.describe() is df.describe()

min         2.720000
25%         3.110000
50%         3.210000
75%         3.320000
max         4.010000

Your cut divides the range into 4 equal-width bins, whereas describe uses quartiles .您的cut将范围划分为 4 个等宽的bin,而describe使用quartiles Only for uniformly distributed data both would result in the same subdivisions.仅对于均匀分布的数据,两者都会导致相同的细分。

Example:例子:

import pandas as pd
import numpy as np

df = pd.DataFrame({'uniform': np.random.rand(1_000_000), 'normal': np.random.randn(1_000_000)})

with np.printoptions(formatter={'float': '{:.3f}'.format}):
    print( 'uniform:\n'
           f'   {df.uniform.describe().iloc[3:].values}\n'
           f'   {pd.cut(df.uniform, 4).dtype.categories.to_tuples().to_list()}')
    print( 'normal:\n'
           f'   {df.normal.describe().iloc[3:].values}\n'
           f'   {pd.cut(df.normal, 4).dtype.categories.to_tuples().to_list()}')

Output:输出:

uniform:
   [0.000 0.250 0.499 0.750 1.000]
   [(-0.001, 0.25), (0.25, 0.5), (0.5, 0.75), (0.75, 1.0)]
normal:
   [-4.908 -0.675 0.001 0.674 5.082]
   [(-4.918, -2.411), (-2.411, 0.0867), (0.0867, 2.584), (2.584, 5.082)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM