简体   繁体   中英

Cleanest way to bin using Pandas.cut

The purpose of this post is discussion primarily, so even loose ideas or strings to pull would be appreciated. I'm trying to bin some data for analysis, and was wondering what is the cleanest way to bin my data using Pandas.cut . For some context, I'm specifically trying to bin ICD-9 diagnostic data into categories and am using this list as a starting point. From what I'm reading, a common way to do this is something like this:

break_points = [0, 139, 239, ...]
labels = ['infectious and parasitic diseases', 'neoplasms', 'endocrine diseases', ...]
df['diag_codes_binned'] = pd.cut(df['diag_codes'],
                                 bins=break_points,
                                 labels=labels)

I recognize that this is a perfectly functional way to do this, but I don't like how hard it is to visually inspect the code and determine what range lines up with what label. I am exploring using a dictionary for this like this:

diagnosis_code_dict = {139: 'infectious and parasitic diseases',
                       239: 'neoplasms',
                       279: 'endocrine diseases',
                       ...}

But the pd.cut function doesn't seem to get along well with my dictionary. There appears to be one way to do this using a dataframe as a lookup table with min and max values, shown here , and that seems to be one possibility (example below):

In [187]: lkp
Out[187]:
   Min  Max  Val
0    1   99  AAA
1  100  199  BBB
2  200  299  CCC
3  300  399  DDD

Lastly, I have one more consideration for the data set that I'm working through the best way to handle. Some of the diagnostic codes start with V or E, and currently I'm planning on pre-processing these to convert them into an extension of the range and handling them that way. For example, if the range of possible non-E/V codes is range(0,1000) , then I could convert E's into a range(1000, 2000) and V's into a range(2000, 3000) so that I could maintain a single lookup table or dictionary for all codes from which I could cut into however many bins I wanted. That said, this method results in some loss of the ability to at-a-glance understand these codes, so I'd be open to suggestions if there is a better way to handle this.

I would simply write a small helper function. Here's one idea:

import pandas as pd

def bin_helper(code_dict):
    break_points = [0] + sorted(code_dict) #0 added for lower bound on binning
    labels = [code_dict[value] for value in sorted(code_dict)]
    return break_points, labels

# Setting up some minimal reproducible code...
data = {'diag_codes': range(1, 300),
        'diag_codes_binned': ''}
df = pd.DataFrame.from_dict(data)
diag_code_dict = {139: 'infectious and parasitic diseases',
                  239: 'neoplasms',
                  279: 'endocrine diseases'}

# Run the function and drop it into pandas.cut
bins, labels = bin_helper(diag_code_dict)
df['diag_codes_binned'] = pd.cut(df['diag_codes'],
                                 bins=bins,
                                 labels=labels)

I agree that dictionaries (besides being an incredibly fast, versatile data structure in their own right!) are a very nice way to provide some context in your code about what data are supposed to mean. I often use a small "black-box" function to do the actual work if I need a dictionary to serve as part of my documentation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM