简体   繁体   中英

How to store the get_dummies transformation of pandas in Python?


There is the get_dummies transformation in the pandas package in python, which transforms categorical variables to binary (flag) variables with values 0 / 1. This transformation is based on the actual values, but I'd like to store the code of the transformation, so that I can run it on other datasets, with less values, and still get the full-sized transformed data structure.

Say you have this code:

import pandas as pd
a = [[5,12,"blue"], [8,53,"yellow"]]
df = pd.DataFrame(a, columns=['Weight','Size','Color'])
df.apply(pd.to_numeric, errors='ignore')
df

Producing this data:

Weight  Size    Color
5       12      blue
8       53      yellow

and:

df = pd.get_dummies(df)
df

produces this:

Weight  Size    Color_blue  Color_yellow
5       12      1           0
8       53      0           1

I'd like to store this original transformation, so that if I get a record later, like:

[2,9,"blue"]

I can still get the whole structure, like:

Weight  Size    Color_blue  Color_yellow
2       9       1           0

Get_dummies will omit the Color_yellow column in the latter case...
What is the simplest solution to it?

I was thinking of something like building my own get_dummies function, which goes through all the categorical variables, gets all their possible distinct values, and then produces the code of the python function, which does the transformation. But there must be some already implemented solution to it...

This is what I was looking for. The code prints the transformations, which has to be done on later datasets:

import pandas as pd
import numpy as np
a = [[5,12,"blue","apple"], [8,53,"yellow","pear"], [1,8,"brown","peach"],[1,2,"blue","plum"]]
df = pd.DataFrame(a, columns=['Weight','Size','Color','Fruit'])
df.apply(pd.to_numeric, errors='ignore')

for col in df.select_dtypes(include=["object"]).columns:
    for i in df[col].unique():
       df[col+"_"+i] = np.where(df[col] == i, 1, 0)
       print('df["'+col+'_'+i+'"] = np.where(df["'+col+'"] == "'+i+'", 1, 0)')
    df = df.drop(columns=[col])
    print('df = df.drop(columns=["'+col+'"])')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM