简体   繁体   中英

How to efficiently label each value to a bin after I created the bins by pandas.cut() function?

Say I have a column in a dataframe which is 'user_age', and I have created 'user_age_bin' by something like:

df['user_age_bin']= pd.cut(df['user_age'], bins=[10, 15, 20, 25,30])

Then I build a machine learning model by using the 'user_age_bin' feature.

Next, I got one record which I need to throw into my model and make prediction. I don't want to use the user_age as it is because the model uses user_age_bin . So, how can I convert a user_age value (say 28) into user_age_bin ? I know I can create a function like this:

def assign_bin(age):
    if age < 10:
        return '<10'
    elif age< 15:
        return '10-15'
     ... etc. etc.

and then do:

user_age_bin = assign_bin(28)

But this solution is not elegant at all. I guess there must be a better way, right?

Edit: I changed the code and added explicit bin range. Edit2: Edited wording and hopefully the question is clearer now.

tl;dr: np.digitize is a good solution.

After reading all the comments and answers here and some more Googling, I think I got a solution that I am pretty satisfied. Thank you to all of you guys!

Setup

import pandas as pd
import numpy as np
np.random.seed(42)

bins = [0, 10, 15, 20, 25, 30, np.inf]
labels = bins[1:]
ages = list(range(5, 90, 5))
df = pd.DataFrame({"user_age": ages})
df["user_age_bin"] = pd.cut(df["user_age"], bins=bins, labels=False)

# sort by age 
print(df.sort_values('user_age'))

Output :

 user_age  user_age_bin
0          5             0
1         10             0
2         15             1
3         20             2
4         25             3
5         30             4
6         35             5
7         40             5
8         45             5
9         50             5
10        55             5
11        60             5
12        65             5
13        70             5
14        75             5
15        80             5
16        85             5

Assign category :

# a new age value
new_age=30

# use this right=True and '-1' trick to make the bins match
print(np.digitize(new_age, bins=bins, right=True) -1)

Output :

4

A bit ugly approach with double list comprehension down the line, but seems to do the job.

Setup:

import pandas as pd
import numpy as np
np.random.seed(42)

bins = [10, 15, 20, 25, 30, np.Inf]
labels = bins[1:]
ages = np.random.randint(10, 35, 10)
df = pd.DataFrame({"user_age": ages})
df["user_age_bin"] = pd.cut(df["user_age"], bins=bins, labels=labels)
print(df)

Out:

   user_age user_age_bin
0        16         20.0
1        29         30.0
2        24         25.0
3        20         20.0
4        17         20.0
5        30         30.0
6        16         20.0
7        28         30.0
8        32          inf
9        20         20.0

Assignment:

# `new_ages` is what you want to assign labels to, used `ages` for simplicity
new_ages = ages
ids = [np.argmax([age <= x for x in labels]) for age in new_ages]
assigned_labels = [labels[i] for i in ids]
print(pd.DataFrame({"new_ages": new_ages, "assigned_labels": assigned_labels, "user_age_bin": df["user_age_bin"]}))

Out:

   new_ages  assigned_labels user_age_bin
0        16             20.0         20.0
1        29             30.0         30.0
2        24             25.0         25.0
3        20             20.0         20.0
4        17             20.0         20.0
5        30             30.0         30.0
6        16             20.0         20.0
7        28             30.0         30.0
8        32              inf          inf
9        20             20.0         20.0

You can try something like:

bins=[10, 15, 20, 25, 30]
labels = [f'<{bins[0]}', *(f'{a}-{b}' for a, b in zip(bins[:-1], bins[1:])), f'{bins[-1]}>']
pd.cut(df['user_age'], bins=bins, labels=labels)

Note that if you are using python<3.7 you should replace f-string by format like syntax.

You can't put strings into a model so you'll need to create a mapping and keep track of it or create a seperate columnn to use later

def apply_age_bin_numeric(value):
    if value <= 10:
        return 1
    elif value > 10 and value <= 20:
        return 2
    elif value > 21 and value <= 30:
        return 3  
    etc....  

def apply_age_bin_string(value):
    if value <= 10:
        return '<=10'
    elif value > 10 and value <= 20:
        return '11-20'
    elif value > 21 and value <= 30:
        return '21-30' 
    etc....

df['user_age_bin_numeric']= df['user_age'].apply(apply_age_bin_numeric)
df['user_age_bin_string']= df['user_age'].apply(apply_age_bin_string)  

For the the model, you'll keep user_age_bin_numeric and drop user_age_bin_string

Save a copy of the data with both fields included before it goes into the model. This way you can match the predictions back to the string version of the bin fields if you want to show those instead of the numerical bins.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM