简体   繁体   中英

Pandas quantile function not returning the correct number of given quantiles

I have a dataframe with over 2,000 records that has multiple columns with various balances. Based on the balance amount I want to assign it to a bucket.

Trying to split each balance column into a quantile and have the following buckets 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9 Concretely, translating the balances into the following buckets: top 10%, top 20%, top 30%, etc...

If I'm understanding correctly, so long as there are more than 10 records it should bucket each record in a percentile based on linear interpolation.

So I run the following:

score_quantiles =  df.quantile(q=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9])
score_quantiles.to_dict()



# Arguments (x = value, p = field (i.e bal 1, bal2, bal3) , d = score_quantiles)

def dlpScore(x,p,d):
    if pd.isnull(x) == True:
        return 0 
    elif int(x) == 0:
        return 0
    elif x <= d[p][0.1]:
        return 1
    elif x <= d[p][0.2]:
        return 2
    elif x <= d[p][0.3]: 
        return 3
    elif x <= d[p][0.4]: 
        return 4
    elif x <= d[p][0.5]: 
        return 5
    elif x <= d[p][0.6]: 
        return 6
    elif x <= d[p][0.7]: 
        return 7
    elif x <= d[p][0.8]: 
        return 8
    elif x <= d[p][0.9]: 
        return 9
    else:
        return 10



df['SCORE_BAL1'] = df['bal1'].apply(dlpScore, args=('bal1',score_quantiles,))




Problem is, on some columns it gives me all the buckets, on others it only gives me a few:

在此处输入图像描述 在此处输入图像描述

Is there a way to ensure it creates all the buckets? I'm probably missing something here.

If you want to make sure that you are getting similar distributions among 'buckets' you might want to try the pandas qcut function. The full documentation is here .

To use it in your code and get deciles for instance you could just do

n_buckets=10
df['quantile'] = pd.qcut(df['target_column'], q=n_buckets)

And if you want to apply a specific label you can just do something like this

n_buckets=10
df['quantile'] = pd.qcut(df['target_column'], q=n_buckets, labels=range(1,n_buckets+1))

PS: Just beware that for this latter case if qcut is not able to generate the desired number of quantiles (eg, because there are not enough values to create a quantile), you will get an exception for passing more labels than there are quantiles.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM