简体   繁体   中英

How can I use pandas get_dummies() for new data?

I am trying to build a sentiment classifier with keras and to evaluate it on different datasets. My problem is that when I try to evaluate it on a different dataset the get_dummies values are different.

I have 5 sentiments: hate, happiness, anger, neutral and sadness distributed as follows: [1 0 0 0 0], [0 1 0 0 0], [0 0 1 0 0], [0 0 0 1 0] and [0 0 0 0 1].

When I try to predict on another dataset, for hate it uses for example [0 0 1 0 0] instead of [1 0 0 0 0]. And so the val_acc and val_loss are not relevant and very bad.

Is there a way I can reindex the get dummies? I can't figure it out how can I do this.

I use the method like this:

tweets = pd.read_csv('data/text_emotion.csv', usecols=[0, 1, 3], names=['id', 'sentiment', 'text'], header=0, encoding="latin-1")
...
y = pd.get_dummies(tweets['sentiment']).values

Thank you in advance!

It is because pd.get_dummies doesn't sort the values before encoding. It takes the input sequence and assigns in that order. Here's a quick example:

# the sequences of values are different

df = pd.DataFrame({'coun':['a','b','c']*2}) # imagine as train
df1 = pd.DataFrame({'coun':['c','b','a']*2}) # imagine as validation

pd.get_dummies(df)

   coun_a  coun_b  coun_c
0       1       0       0
1       0       1       0
2       0       0       1
3       1       0       0
4       0       1       0
5       0       0       1

pd.get_dummies(df1)

   coun_a  coun_b  coun_c
0       0       0       1
1       0       1       0
2       1       0       0
3       0       0       1
4       0       1       0
5       1       0       0

The encoding of value a is different in both the cases. Hence, in order to fix this, you can do:

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()

# fit on train 
ohe.fit(df['coun'].values.reshape(-1,1))

# transform on train/valid
y_train_label = ohe.transform(df['coun'].values.reshape(-1,1)).todense()
y_valid_label = ohe.transform(df1['coun'].values.reshape(-1,1)).todense()

Depending on the problem, you can do:

A)Sort your data if there is only one categorical column you want to one hot encode.

B) You can append all the dataframes if that is all your data (you don't need to operate on any other dataset after that, like for a submission based competition) and then use get_dummies() and then separate them again.

C) as suggested by YOLO, use sklearn's OHE method.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM