简体   繁体   中英

How to get the top 5 percentile values in pandas series for each class?

I was solving a practice question where I wanted to get the top 5 percentile of frauds for each state. I was able to solve it in SQL but the pandas gives a different answer for me than SQL.

Full Question

Top Percentile Fraud
ABC Corp is a mid-sized insurer in the US
and in the recent past their fraudulent claims have increased significantly for their personal auto insurance portfolio.
They have developed a ML based predictive model to identify
propensity of fraudulent claims.

Now, they assign highly experienced claim adjusters for top 5 percentile of claims identified by the model.

Your objective is to identify the top 5 percentile of claims from each state. 
Your output should be policy number, state, claim cost, and fraud score.

Question: How to get the same answer in pandas that I obtained from SQL?

My attempt

  • I break the fraud score in 100 equal parts using pandas cut and get categorical codes for each bins, then I took values above or equal to 95, but this gives different result.
  • I am trying to get same answer that I got from SQL query.
import numpy as np
import pandas as pd

url = "https://raw.githubusercontent.com/bpPrg/Share/master/data/fraud_score.tsv"
df = pd.read_csv(url,delimiter='\t')
print(df.shape) # (400, 4)

df.head(2)
    policy_num  state   claim_cost  fraud_score
0   ABCD1001    CA  4113    0.613
1   ABCD1002    CA  3946    0.156

Problem

  • Group by each state, and find top 5 percentile fraud scores.

My attempt

df['state_ntile'] = df.groupby('state')['fraud_score']\
.apply(lambda ser: pd.cut(ser,100).cat.codes+1) # +1 makes 1 to 100 including.

df.query('state_ntile >=95')\
.sort_values(['state','fraud_score'],ascending=[True,False]).reset_index(drop=True)

Postgres SQL code ( I know SQL, I want answer in pandas)

SELECT policy_num,
       state,
       claim_cost,
       fraud_score,
       a.percentile
FROM
  (SELECT *,
          ntile(100) over(PARTITION BY state
                          ORDER BY fraud_score DESC) AS percentile
   FROM fraud_score)a
WHERE percentile <=5

The output I want


policy_num  state   claim_cost  fraud_score percentile
0   ABCD1027    CA  2663    0.988   1
1   ABCD1016    CA  1639    0.964   2
2   ABCD1079    CA  4224    0.963   3
3   ABCD1081    CA  1080    0.951   4
4   ABCD1069    CA  1426    0.948   5
5   ABCD1222    FL  2392    0.988   1
6   ABCD1218    FL  1419    0.961   2
7   ABCD1291    FL  2581    0.939   3
8   ABCD1230    FL  2560    0.923   4
9   ABCD1277    FL  2057    0.923   5
10  ABCD1189    NY  3577    0.982   1
11  ABCD1117    NY  4903    0.978   2
12  ABCD1187    NY  3722    0.976   3
13  ABCD1196    NY  2994    0.973   4
14  ABCD1121    NY  4009    0.969   5
15  ABCD1361    TX  4950    0.999   1
16  ABCD1304    TX  1407    0.996   1
17  ABCD1398    TX  3191    0.978   2
18  ABCD1366    TX  2453    0.968   3
19  ABCD1386    TX  4311    0.963   4
20  ABCD1363    TX  4103    0.960   5

pure pandas

You can use rank() to get percentiles:

out = df.assign(
    percentile=(100 * df.groupby('state')['fraud_score']
                .rank(ascending=False, pct=True, method='first'))
    .truncate().astype(int)
).query('percentile <= 5')

The outcome is in a different order than the original df , but contains the information you seek:

>>> out
    policy_num state  claim_cost  fraud_score  percentile
15    ABCD1016    CA        1639        0.964           2
26    ABCD1027    CA        2663        0.988           1
68    ABCD1069    CA        1426        0.948           5
78    ABCD1079    CA        4224        0.963           3
80    ABCD1081    CA        1080        0.951           4
116   ABCD1117    NY        4903        0.978           2
120   ABCD1121    NY        4009        0.969           5
186   ABCD1187    NY        3722        0.976           3
188   ABCD1189    NY        3577        0.982           1
195   ABCD1196    NY        2994        0.973           4
217   ABCD1218    FL        1419        0.961           2
221   ABCD1222    FL        2392        0.988           1
229   ABCD1230    FL        2560        0.923           4
276   ABCD1277    FL        2057        0.923           5
290   ABCD1291    FL        2581        0.939           3
303   ABCD1304    TX        1407        0.996           1
360   ABCD1361    TX        4950        0.999           0
362   ABCD1363    TX        4103        0.960           5
365   ABCD1366    TX        2453        0.968           3
385   ABCD1386    TX        4311        0.963           4
397   ABCD1398    TX        3191        0.978           2

duckdb

Having spent over a decade with PostgreSQL (and the late, wonderful Greenplum), I have grown quite fond of duckdb . It is very fast, can operate straight on (from/to) parquet files, etc. Definitely a space to watch.

Here is how it looks on your data:

duckdb.query_df(df, 'df', """
SELECT policy_num,
       state,
       claim_cost,
       fraud_score,
       a.percentile
FROM
  (SELECT *,
          ntile(100) over(PARTITION BY state
                          ORDER BY fraud_score DESC) AS percentile
   FROM df) as a
WHERE percentile <=5
""").df()

And the result:

   policy_num state  claim_cost  fraud_score  percentile
0    ABCD1222    FL        2392        0.988           1
1    ABCD1218    FL        1419        0.961           2
2    ABCD1291    FL        2581        0.939           3
3    ABCD1230    FL        2560        0.923           4
4    ABCD1277    FL        2057        0.923           5
5    ABCD1361    TX        4950        0.999           1
6    ABCD1304    TX        1407        0.996           1
7    ABCD1398    TX        3191        0.978           2
8    ABCD1366    TX        2453        0.968           3
9    ABCD1386    TX        4311        0.963           4
10   ABCD1363    TX        4103        0.960           5
11   ABCD1027    CA        2663        0.988           1
12   ABCD1016    CA        1639        0.964           2
13   ABCD1079    CA        4224        0.963           3
14   ABCD1081    CA        1080        0.951           4
15   ABCD1069    CA        1426        0.948           5
16   ABCD1189    NY        3577        0.982           1
17   ABCD1117    NY        4903        0.978           2
18   ABCD1187    NY        3722        0.976           3
19   ABCD1196    NY        2994        0.973           4
20   ABCD1121    NY        4009        0.969           5

Comparison

An attentive eye will reveal that there are subtle differences between the two results above (beyond the ordering). This is due to different definitions of percentiles (vs. ntile(100) ).

Here is how to see these differences:

a = out.set_index('policy_num').sort_index()
b = duck_out.set_index('policy_num').sort_index()

Then:

>>> a.equals(b)
False

>>> a[(a != b).any(1)]
           state  claim_cost  fraud_score  percentile
policy_num                                           
ABCD1361      TX        4950        0.999           0

>>> b[(a != b).any(1)]
           state  claim_cost  fraud_score  percentile
policy_num                                           
ABCD1361      TX        4950        0.999           1

If we look at the value (before truncation) of percentile :

>>> s = (a != b).any(1)
>>> df.assign(
...     percentile=(100 * df.groupby('state')['fraud_score'].rank(
...         ascending=False, pct=True, method='first'))
... ).set_index('policy_num').loc[s[s].index]
           state  claim_cost  fraud_score  percentile
policy_num                                           
ABCD1361      TX        4950        0.999    0.990099

Thanks to Emma, I got the partial solution. I could not get the ranks like 1,2,3,...,100 but the resultant table is at least same from the output of SQL. I am still learning how to use the pandas.

Logic:

  • To get the top 5 percentile, we can use quantile values >= 0.95 as shown below:
import numpy as np
import pandas as pd

url = "https://raw.githubusercontent.com/bpPrg/Share/master/data/fraud_score.tsv"
df = pd.read_csv(url,delimiter='\t')
print(df.shape)

df['state_quantile'] = df.groupby('state')['fraud_score'].transform(lambda x: x.quantile(0.95))

dfx = df.query("fraud_score >= state_quantile").reset_index(drop=True)\
.sort_values(['state','fraud_score'],ascending=[True,False])

dfx

Result

    policy_num  state   claim_cost  fraud_score state_quantile
1   ABCD1027    CA  2663    0.988   0.94710
0   ABCD1016    CA  1639    0.964   0.94710
3   ABCD1079    CA  4224    0.963   0.94710
4   ABCD1081    CA  1080    0.951   0.94710
2   ABCD1069    CA  1426    0.948   0.94710
11  ABCD1222    FL  2392    0.988   0.91920
10  ABCD1218    FL  1419    0.961   0.91920
14  ABCD1291    FL  2581    0.939   0.91920
12  ABCD1230    FL  2560    0.923   0.91920
13  ABCD1277    FL  2057    0.923   0.91920
8   ABCD1189    NY  3577    0.982   0.96615
5   ABCD1117    NY  4903    0.978   0.96615
7   ABCD1187    NY  3722    0.976   0.96615
9   ABCD1196    NY  2994    0.973   0.96615
6   ABCD1121    NY  4009    0.969   0.96615
16  ABCD1361    TX  4950    0.999   0.96000
15  ABCD1304    TX  1407    0.996   0.96000
20  ABCD1398    TX  3191    0.978   0.96000
18  ABCD1366    TX  2453    0.968   0.96000
19  ABCD1386    TX  4311    0.963   0.96000
17  ABCD1363    TX  4103    0.960   0.96000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM