简体   繁体   中英

Pandas - After I created dummy columns how do I given a new vector get the dummy representation?

Some of my columns contain textual categoric values like " did_do_something " with the possible values of " true " or " false " or another column might be " browser_type " with possible values of " chrome ", " safari " but I also have other columns containing numeric category "enumeration" such as " version_type " which could have a value like "1" ,"2" ,"3" ,"4" and then there are just plain numeric columns like " age " which just get a numeric value and should stay unaltered.

I checked the pandas documentation here https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

Specifically this flag:

columns : list-like, default None

Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted.

My dummy processing looks like this:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

data_csv_file = 'data/data.csv'

data = pd.read_table(data_csv_file,delimiter = ",").dropna()

# this is the column containing the label for the row
label_column = 'converted_pixel'

# these columns SHOULD NOT be encoded since they are read numeric values
numeric_columns = ['campaign_frequency','user_age_days']

# all the other columns which are not label or numeric should be dummy encoded
dummy_columns = [a for a in data.columns if a != label_column and a not in numeric_columns]

# create the new processed data frame with the dummy columns
processed_dummy_data = pd.get_dummies(data,columns = dummy_columns)

The processed data frame results in around 1000 columns from the original 21 columns.

My question is given a vector from the original dataframe, how can I get it's dummy encoding from the resulting dummy ?

Since the dummy data frame is so large, it is unreasonable to do this on my own.

I am looking for an API like

dummy_encoded_vector = get_dummy_encoding(vector_from_original_dataframe_encoding, processed_dummy_data)

You can using str.contains

df=pd.DataFrame({'A':list('abcde'),'B':list('abcde')})

s=pd.get_dummies(df)

yourcol='A'

s.loc[:,s.columns.str.contains(yourcol+'_')]

Out[117]: 
   A_a  A_b  A_c  A_d  A_e
0    1    0    0    0    0
1    0    1    0    0    0
2    0    0    1    0    0
3    0    0    0    1    0
4    0    0    0    0    1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM