How to store the get_dummies transformation of pandas in Python?

Question

There is the get_dummies transformation in the pandas package in python, which transforms categorical variables to binary (flag) variables with values 0 / 1. This transformation is based on the actual values, but I'd like to store the code of the transformation, so that I can run it on other datasets, with less values, and still get the full-sized transformed data structure.

Say you have this code:

import pandas as pd
a = [[5,12,"blue"], [8,53,"yellow"]]
df = pd.DataFrame(a, columns=['Weight','Size','Color'])
df.apply(pd.to_numeric, errors='ignore')
df

Producing this data:

Weight  Size    Color
5       12      blue
8       53      yellow

and:

df = pd.get_dummies(df)
df

produces this:

Weight  Size    Color_blue  Color_yellow
5       12      1           0
8       53      0           1

I'd like to store this original transformation, so that if I get a record later, like:

[2,9,"blue"]

I can still get the whole structure, like:

Weight  Size    Color_blue  Color_yellow
2       9       1           0

Get_dummies will omit the Color_yellow column in the latter case...
What is the simplest solution to it?

I was thinking of something like building my own get_dummies function, which goes through all the categorical variables, gets all their possible distinct values, and then produces the code of the python function, which does the transformation. But there must be some already implemented solution to it...

Answer 1

This is what I was looking for. The code prints the transformations, which has to be done on later datasets:

import pandas as pd
import numpy as np
a = [[5,12,"blue","apple"], [8,53,"yellow","pear"], [1,8,"brown","peach"],[1,2,"blue","plum"]]
df = pd.DataFrame(a, columns=['Weight','Size','Color','Fruit'])
df.apply(pd.to_numeric, errors='ignore')

for col in df.select_dtypes(include=["object"]).columns:
    for i in df[col].unique():
       df[col+"_"+i] = np.where(df[col] == i, 1, 0)
       print('df["'+col+'_'+i+'"] = np.where(df["'+col+'"] == "'+i+'", 1, 0)')
    df = df.drop(columns=[col])
    print('df = df.drop(columns=["'+col+'"])')

How to store the get_dummies transformation of pandas in Python?

Question

1 answers

solution1
1 ACCPTED 2018-02-26 09:23:26

How to store the get_dummies transformation of pandas in Python?

Question

1 answers

solution1 1 ACCPTED 2018-02-26 09:23:26

solution1
1 ACCPTED 2018-02-26 09:23:26