简体   繁体   中英

Pandas: map values of categorical variable to a predefined list of dummy columns

I have a categorical variable with known levels (eg hour that just contains values between 0 and 23), but not all of them are available right now (say, we have measurements from between 0 and 11 o'clock, while hours from 12 to 23 are not covered), though other values are going to be added later. If we naively use pandas.get_dummies() to map values to indicator variables, we will end up with only 12 of them instead of 24. Is there a way to map values of the categorical variable to a predefined list of dummy variables ?

Here's an example of expected behaviour:

possible_values = range(24)
hours = get_dummies_on_steroids(df['hour'], prefix='hour', levels=possible_values)

Using the new and improved Categorical type in pandas 0.15:

import pandas as pd
import numpy as np
df = pd.DataFrame({'hour': [0, 1, 3, 8, 13, 14], 'val': np.random.randn(6)})
df
Out[4]: 
   hour       val
0     0 -0.098287
1     1 -0.682777
2     3  1.000749
3     8 -0.558877
4    13  1.423675
5    14  1.461552

df['hour_cat'] = pd.Categorical(df['hour'], categories=range(24))
pd.get_dummies(df['hour_cat'])
Out[6]: 
   0   1   2   3   4   5   6   7   8   9  ...  
0   1   0   0   0   0   0   0   0   0   0 ...      
1   0   1   0   0   0   0   0   0   0   0 ...   
2   0   0   0   1   0   0   0   0   0   0 ...   
3   0   0   0   0   0   0   0   0   1   0 ...   
4   0   0   0   0   0   0   0   0   0   0 ...   
5   0   0   0   0   0   0   0   0   0   0 ...

The situation you describe, where you know your data can take a specific set of values, but you haven't necessarily observed all of them, is exactly what Categorical is good for.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM