简体   繁体   中英

Pandas cut with non-unique labels

I'm trying to bin data and apply a float value based on the bin. I thought pandas.cut was the tool for this, but apparently it requires unique values for each bin label.

values = [0.6, 0.5, 0.5, 0.6, 0.8, 0.9]
bins = [0, 2, 5, 10, 15, 25, 200]
binned = pd.cut(original_table[field], bins, labels=values)

>>> ValueError: Categorical categories must be unique

My data (original_table) is very large and doing anything iteratively is quite slow, which is why cut was an appealing tool. Is there a workaround to make pd.cut work for this?

Found a workaround:

values = [0.6, 0.5, 0.5, 0.6, 0.8, 0.9]
bins = [0, 2, 5, 10, 15, 25, 200]
binned = np.array(values)[pd.cut(original_table[field], bins, labels=False)]

Here is another option to circumvent this issue, which I have found here . Also looks like it will be fixed soon

import pandas as pd
import numpy as np


values = [0.6, 0.5, 0.5, 0.6, 0.8, 0.9]
bins = [0, 2, 5, 10, 15, 25, 200]

# Cut it
binned = pd.cut(original_table[field], bins, labels=pd.Categorical(values))

Demo:

In [127]: df = pd.DataFrame({'val':np.random.randint(0, 200, 10)})

In [128]: values = ['0.6', '0.5', '0.5X', '0.6X', '0.8', '0.9']
     ...: bins = [0, 2, 5, 10, 15, 25, 200]
     ...:

In [129]: df['new'] = pd.cut(df['val'], bins, labels=values).str.replace('X','').astype('float')

In [130]: df
Out[130]:
   val  new
0   25  0.8
1  115  0.9
2   63  0.9
3   29  0.9
4   74  0.9
5  133  0.9
6  194  0.9
7  152  0.9
8   94  0.9
9   84  0.9

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM