简体   繁体   中英

using get_dummies() and OneHotEncoding on large number of Categorical Variable

In most of the Academic examples, we used to convert categorical features using get_dummies() or OneHotEncoding() . Lets say I want to use Country as a feature and in the dataset we have 100 unique countries. When we apply get_dummies() or OneHotEncoding() on country we will get 100 columns and model will be trained with 100 country columns + other features.

Lets say, we have deployed this model into production, and we received only 10 countries. When we pre-process the data by using get_dummies() or OneHotEncoding() , then model will fail predict because "Number of features model trained is not matching with the features passed" as we are passing 10 country columns + other features.

Can you please help me to understand how to handle such scenarios.How to deal with Large number of Categorical variables in multiple columns can be pre-process in the Model building.

The pandas.get_dummies() function indeed should not be used in deployment, for the reason you described. The scikit-learn's OneHotEncoder, though, handles this situation just fine:

from sklearn import preprocessing
import pandas as pd

ohe = preprocessing.OneHotEncoder(handle_unknown='ignore')
X_train = pd.DataFrame({'country':['USA', 'Russia', 'China', 'Spain']})
X_test = pd.DataFrame({'country':['Russia', 'Ukraine', 'China', 'Russia']})
ohe.fit(X_train) 
ohe.transform(X_test).toarray()

array([[0., 1., 0., 0.],
       [0., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.]])

(Here I have set handle_unknown='ignore' so that new labels ('Ukraine') get encoded as all zeros. If you set handle_unknown='error' (which is the default), new labels will be raising errors.) So, the OneHotEncoder can handle a different set of labels in the test set.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM