[英]How to get the top 5 percentile values in pandas series for each class?
我正在解决一个练习题,我想为每个 state 获得前 5 个百分比的欺诈行为。 我能够在 SQL 中解决它,但 pandas 给我的答案与 SQL 不同。
完整问题
Top Percentile Fraud
ABC Corp is a mid-sized insurer in the US
and in the recent past their fraudulent claims have increased significantly for their personal auto insurance portfolio.
They have developed a ML based predictive model to identify
propensity of fraudulent claims.
Now, they assign highly experienced claim adjusters for top 5 percentile of claims identified by the model.
Your objective is to identify the top 5 percentile of claims from each state.
Your output should be policy number, state, claim cost, and fraud score.
import numpy as np
import pandas as pd
url = "https://raw.githubusercontent.com/bpPrg/Share/master/data/fraud_score.tsv"
df = pd.read_csv(url,delimiter='\t')
print(df.shape) # (400, 4)
df.head(2)
policy_num state claim_cost fraud_score
0 ABCD1001 CA 4113 0.613
1 ABCD1002 CA 3946 0.156
df['state_ntile'] = df.groupby('state')['fraud_score']\
.apply(lambda ser: pd.cut(ser,100).cat.codes+1) # +1 makes 1 to 100 including.
df.query('state_ntile >=95')\
.sort_values(['state','fraud_score'],ascending=[True,False]).reset_index(drop=True)
SELECT policy_num,
state,
claim_cost,
fraud_score,
a.percentile
FROM
(SELECT *,
ntile(100) over(PARTITION BY state
ORDER BY fraud_score DESC) AS percentile
FROM fraud_score)a
WHERE percentile <=5
policy_num state claim_cost fraud_score percentile
0 ABCD1027 CA 2663 0.988 1
1 ABCD1016 CA 1639 0.964 2
2 ABCD1079 CA 4224 0.963 3
3 ABCD1081 CA 1080 0.951 4
4 ABCD1069 CA 1426 0.948 5
5 ABCD1222 FL 2392 0.988 1
6 ABCD1218 FL 1419 0.961 2
7 ABCD1291 FL 2581 0.939 3
8 ABCD1230 FL 2560 0.923 4
9 ABCD1277 FL 2057 0.923 5
10 ABCD1189 NY 3577 0.982 1
11 ABCD1117 NY 4903 0.978 2
12 ABCD1187 NY 3722 0.976 3
13 ABCD1196 NY 2994 0.973 4
14 ABCD1121 NY 4009 0.969 5
15 ABCD1361 TX 4950 0.999 1
16 ABCD1304 TX 1407 0.996 1
17 ABCD1398 TX 3191 0.978 2
18 ABCD1366 TX 2453 0.968 3
19 ABCD1386 TX 4311 0.963 4
20 ABCD1363 TX 4103 0.960 5
您可以使用rank()
获取百分位数:
out = df.assign(
percentile=(100 * df.groupby('state')['fraud_score']
.rank(ascending=False, pct=True, method='first'))
.truncate().astype(int)
).query('percentile <= 5')
结果的顺序与原始df
不同,但包含您寻求的信息:
>>> out
policy_num state claim_cost fraud_score percentile
15 ABCD1016 CA 1639 0.964 2
26 ABCD1027 CA 2663 0.988 1
68 ABCD1069 CA 1426 0.948 5
78 ABCD1079 CA 4224 0.963 3
80 ABCD1081 CA 1080 0.951 4
116 ABCD1117 NY 4903 0.978 2
120 ABCD1121 NY 4009 0.969 5
186 ABCD1187 NY 3722 0.976 3
188 ABCD1189 NY 3577 0.982 1
195 ABCD1196 NY 2994 0.973 4
217 ABCD1218 FL 1419 0.961 2
221 ABCD1222 FL 2392 0.988 1
229 ABCD1230 FL 2560 0.923 4
276 ABCD1277 FL 2057 0.923 5
290 ABCD1291 FL 2581 0.939 3
303 ABCD1304 TX 1407 0.996 1
360 ABCD1361 TX 4950 0.999 0
362 ABCD1363 TX 4103 0.960 5
365 ABCD1366 TX 2453 0.968 3
385 ABCD1386 TX 4311 0.963 4
397 ABCD1398 TX 3191 0.978 2
在使用 PostgreSQL(以及已故的、出色的 Greenplum)十多年后,我越来越喜欢duckdb
。 它非常快,可以直接操作(从/到)镶木地板文件等。绝对是一个值得观看的空间。
以下是它在您的数据上的外观:
duckdb.query_df(df, 'df', """
SELECT policy_num,
state,
claim_cost,
fraud_score,
a.percentile
FROM
(SELECT *,
ntile(100) over(PARTITION BY state
ORDER BY fraud_score DESC) AS percentile
FROM df) as a
WHERE percentile <=5
""").df()
结果:
policy_num state claim_cost fraud_score percentile
0 ABCD1222 FL 2392 0.988 1
1 ABCD1218 FL 1419 0.961 2
2 ABCD1291 FL 2581 0.939 3
3 ABCD1230 FL 2560 0.923 4
4 ABCD1277 FL 2057 0.923 5
5 ABCD1361 TX 4950 0.999 1
6 ABCD1304 TX 1407 0.996 1
7 ABCD1398 TX 3191 0.978 2
8 ABCD1366 TX 2453 0.968 3
9 ABCD1386 TX 4311 0.963 4
10 ABCD1363 TX 4103 0.960 5
11 ABCD1027 CA 2663 0.988 1
12 ABCD1016 CA 1639 0.964 2
13 ABCD1079 CA 4224 0.963 3
14 ABCD1081 CA 1080 0.951 4
15 ABCD1069 CA 1426 0.948 5
16 ABCD1189 NY 3577 0.982 1
17 ABCD1117 NY 4903 0.978 2
18 ABCD1187 NY 3722 0.976 3
19 ABCD1196 NY 2994 0.973 4
20 ABCD1121 NY 4009 0.969 5
细心的眼睛会发现上面的两个结果之间存在细微的差异(除了排序之外)。 这是由于百分位数的不同定义(与ntile(100)
相比)。
以下是如何查看这些差异:
a = out.set_index('policy_num').sort_index()
b = duck_out.set_index('policy_num').sort_index()
然后:
>>> a.equals(b)
False
>>> a[(a != b).any(1)]
state claim_cost fraud_score percentile
policy_num
ABCD1361 TX 4950 0.999 0
>>> b[(a != b).any(1)]
state claim_cost fraud_score percentile
policy_num
ABCD1361 TX 4950 0.999 1
如果我们查看percentile
的值(截断之前):
>>> s = (a != b).any(1)
>>> df.assign(
... percentile=(100 * df.groupby('state')['fraud_score'].rank(
... ascending=False, pct=True, method='first'))
... ).set_index('policy_num').loc[s[s].index]
state claim_cost fraud_score percentile
policy_num
ABCD1361 TX 4950 0.999 0.990099
感谢艾玛,我得到了部分解决方案。 我无法获得像 1,2,3,...,100 这样的排名,但结果表至少与 SQL 的 output 相同。 我还在学习如何使用 pandas。
逻辑:
import numpy as np
import pandas as pd
url = "https://raw.githubusercontent.com/bpPrg/Share/master/data/fraud_score.tsv"
df = pd.read_csv(url,delimiter='\t')
print(df.shape)
df['state_quantile'] = df.groupby('state')['fraud_score'].transform(lambda x: x.quantile(0.95))
dfx = df.query("fraud_score >= state_quantile").reset_index(drop=True)\
.sort_values(['state','fraud_score'],ascending=[True,False])
dfx
policy_num state claim_cost fraud_score state_quantile
1 ABCD1027 CA 2663 0.988 0.94710
0 ABCD1016 CA 1639 0.964 0.94710
3 ABCD1079 CA 4224 0.963 0.94710
4 ABCD1081 CA 1080 0.951 0.94710
2 ABCD1069 CA 1426 0.948 0.94710
11 ABCD1222 FL 2392 0.988 0.91920
10 ABCD1218 FL 1419 0.961 0.91920
14 ABCD1291 FL 2581 0.939 0.91920
12 ABCD1230 FL 2560 0.923 0.91920
13 ABCD1277 FL 2057 0.923 0.91920
8 ABCD1189 NY 3577 0.982 0.96615
5 ABCD1117 NY 4903 0.978 0.96615
7 ABCD1187 NY 3722 0.976 0.96615
9 ABCD1196 NY 2994 0.973 0.96615
6 ABCD1121 NY 4009 0.969 0.96615
16 ABCD1361 TX 4950 0.999 0.96000
15 ABCD1304 TX 1407 0.996 0.96000
20 ABCD1398 TX 3191 0.978 0.96000
18 ABCD1366 TX 2453 0.968 0.96000
19 ABCD1386 TX 4311 0.963 0.96000
17 ABCD1363 TX 4103 0.960 0.96000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.