[英]Python Pandas Count of unique column values based on another column
我有一张桌子(下面的例子)
kick_result kick_yards kicker
50 MADE 28.0 X1
64 MADE 30.0 X2
75 MADE 27.0 X2
158 MADE 32.0 X2
259 MISS 46.0 X3
对于 kicker 的每个值 -
我想计算有多少投篮命中和错失(%)
对于 kicker 的每个值 -
我想找出每个码数范围 <20、21-30、31-40、41-50、51+ 的投篮命中数和未命中数
让我们用cut
和crosstab
链接
out = pd.crosstab([df.kicker,pd.cut(df.kick_yards,[20,30,40,50,np.Inf],include_lowest=True)]
,df.kick_result,normalize='index')
out
Out[228]:
kick_result MADE MISS
kicker kick_yards
X1 (19.999, 30.0] 1.0 0.0
X2 (19.999, 30.0] 1.0 0.0
(30.0, 40.0] 1.0 0.0
X3 (40.0, 50.0] 0.0 1.0
由于您的要求有两个部分:
让我们一一解决。
我们可以使用df.groupby()
和.value_counts(normalize=True)
来得到它:
(df.groupby('kicker')['kick_result']
.value_counts(normalize=True).mul(100).round(2)
.sort_index()
.to_frame(name='Result_%')
).reset_index()
测试数据构建:
为了对各种需求进行完整的测试,我添加了如下测试数据:
kick_result kick_yards kicker
49 MADE 18.0 X1
50 MADE 28.0 X1
51 MADE 38.0 X1
52 MISS 48.0 X1
53 MISS 58.0 X1
64 MADE 30.0 X2
75 MADE 27.0 X2
158 MADE 32.0 X2
159 MISS 32.0 X2
160 MISS 42.0 X2
259 MISS 46.0 X3
260 MISS 26.0 X3
261 MADE 56.0 X3
运行代码:
(df.groupby('kicker')['kick_result']
.value_counts(normalize=True).mul(100).round(2)
.sort_index()
.to_frame(name='Result_%')
).reset_index()
结果:
kicker kick_result Result_%
0 X1 MADE 60.00
1 X1 MISS 40.00
2 X2 MADE 60.00
3 X2 MISS 40.00
4 X3 MADE 33.33
5 X3 MISS 66.67
我们可以使用pd.crosstab()
和pd.cut()
来构建一个包含码范围的表格。
还包括所有范围的总尝试次数。
pd.crosstab(index=[df['kicker'], pd.cut(df['kick_yards'],[0, 20, 30, 40, 50, np.inf])],
columns=df['kick_result'],
margins=True, margins_name='Total_Attempts')
结果(使用丰富的测试数据):
kick_result MADE MISS Total_Attempts
kicker kick_yards
X1 (0.0, 20.0] 1 0 1
(20.0, 30.0] 1 0 1
(30.0, 40.0] 1 0 1
(40.0, 50.0] 0 1 1
(50.0, inf] 0 1 1
X2 (20.0, 30.0] 2 0 2
(30.0, 40.0] 1 1 2
(40.0, 50.0] 0 1 1
X3 (20.0, 30.0] 0 1 1
(40.0, 50.0] 0 1 1
(50.0, inf] 1 0 1
Total_Attempts 7 6 13
利用get_dummies
, cut
并构建生成的DataFrame
:
df['Att'] = 1
dfmm = pd.get_dummies(df['kick_result'])
cols_A = ['A20','A21-30','A31-40','A41-50','A51+']
cols_M = [x.replace('A','M') for x in cols_A]
df_att = pd.DataFrame(pd.get_dummies(pd.cut(df.kick_yards,[0,20,30,40,50,np.Inf],include_lowest=True)))
df_att.columns = df_att.columns.to_list()
df_att.columns = cols_A
df_made = df_att.multiply(dfmm['MADE'], axis=0)
df_made.columns=cols_M
dff = pd.concat([df,dfmm,df_att,df_made], axis=1).drop(['kick_result','kick_yards'], axis=1)
结果 DataFrame:
kicker Att MADE MISS A20 A21-30 A31-40 A41-50 A51+ M20 M21-30 \
0 X1 1 1 0 0 1 0 0 0 0 1
1 X2 1 1 0 0 1 0 0 0 0 1
2 X2 1 1 0 0 1 0 0 0 0 1
3 X2 1 1 0 0 0 1 0 0 0 0
4 X3 1 0 1 0 0 0 1 0 0 0
M31-40 M41-50 M51+
0 0 0 0
1 0 0 0
2 0 0 0
3 1 0 0
4 0 0 0
来自该 DataFrame 的聚合:
dff.groupby('kicker').agg(['sum'])
Att MADE MISS A20 A21-30 A31-40 A41-50 A51+ M20 M21-30 M31-40 M41-50 \
sum sum sum sum sum sum sum sum sum sum sum sum
kicker
X1 1 1 0 0 1 0 0 0 0 1 0 0
X2 3 3 0 0 2 1 0 0 0 2 1 0
X3 1 0 1 0 0 0 1 0 0 0 0 0
M51+
sum
kicker
X1 0
X2 0
X3 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.