简体   繁体   English

从分类数据类型列中提取均值

[英]Extract mean from categorical data type column

For example have this table:例如有这个表:

list_1=[['1','y',474.0],
        ['2','n',482.0],
       ['3','n',564.0,],
       ['4','y',549.0,],
       ['5','y',551.0,],
       ['6','y',555.0,],
       ['7','n',600.0,],
       ['8','y',357.0,],
       ['9','y',542.0,],
       ['10','n',462.0,],
       ['11','n',513.0,],
       ['12','y',526.0,]]
labels=['id','order_?','hours_spend']
df=pd.DataFrame(list_1,columns=labels)
df

Result:结果:

    id  order_? hours_spend
0   1   y        474.0
1   2   n        482.0
2   3   n        564.0
3   4   y        549.0
4   5   y        551.0
5   6   y        555.0
6   7   n        600.0
7   8   y        357.0
8   9   y        542.0
9   10  n        462.0
10  11  n        513.0
11  12  y        526.0

I'm going to divide the hours_spend column to 3 groups with NTILE method without lables:我将使用不带标签的 NTILE 方法将 hours_spend 列分为 3 组:

df['ntile']=pd.qcut(df['hours_spend'],3)
df

Result:结果:

id  order_? hours_spend      ntile
0   1   y   474.0        (356.999, 502.667]
1   2   n   482.0        (356.999, 502.667]
2   3   n   564.0        (549.667, 600.0]
3   4   y   549.0        (502.667, 549.667]
4   5   y   551.0        (549.667, 600.0]
5   6   y   555.0        (549.667, 600.0]
6   7   n   600.0        (549.667, 600.0]
7   8   y   357.0        (356.999, 502.667]
8   9   y   542.0        (502.667, 549.667]
9   10  n   462.0        (356.999, 502.667]
10  11  n   513.0        (502.667, 549.667]
11  12  y   526.0        (502.667, 549.667]

Now i have column 'ntile' with data type 'category':现在我有数据类型为“类别”的“ntile”列:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 4 columns):
id             12 non-null object
order_?        12 non-null object
hours_spend    12 non-null float64
ntile          12 non-null category
dtypes: category(1), float64(1), object(2)
memory usage: 556.0+ bytes

How I can add new column with mean of range in 'ntile' column ??如何在“ntile”列中添加具有均值范围的新列?

You can simply define it as:您可以简单地将其定义为:

df['mean_ntile'] = (df['ntile'].apply(lambda x: x.left).astype(int) + df['ntile'].apply(lambda x: x.right).astype(int))/2
print(df)

Output:输出:

    id order_?  hours_spend               ntile  mean_ntile
0    1       y        474.0  (356.999, 502.667]       429.0
1    2       n        482.0  (356.999, 502.667]       429.0
2    3       n        564.0    (549.667, 600.0]       574.5
3    4       y        549.0  (502.667, 549.667]       525.5
4    5       y        551.0    (549.667, 600.0]       574.5
5    6       y        555.0    (549.667, 600.0]       574.5
6    7       n        600.0    (549.667, 600.0]       574.5
7    8       y        357.0  (356.999, 502.667]       429.0
8    9       y        542.0  (502.667, 549.667]       525.5
9   10       n        462.0  (356.999, 502.667]       429.0
10  11       n        513.0  (502.667, 549.667]       525.5
11  12       y        526.0  (502.667, 549.667]       525.5

As @ALlolz suggested an easier way is:正如@ALlolz 建议的那样,一种更简单的方法是:

df['mean_ntile'] = df['ntile'].apply(lambda x: x.mid)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM