What is the best way to encode string values using pytorch?
df_train.head():
country league home_team away_team home_odds draw_odds away_odds home_score away_score dow month
0 Brazil Copa do Nordeste Sport Recife Imperatriz 1.36 4.31 7.66 2 2 4 2
1 Brazil Copa do Nordeste ABC America RN 2.62 3.30 2.48 2 1 6 2
2 Brazil Copa do Nordeste Frei Paulistano Nautico 5.19 3.58 1.62 0 2 6 2
3 Brazil Copa do Nordeste Botafogo PB Confianca 2.06 3.16 3.50 1 1 6 2
4 Brazil Copa do Nordeste Fortaleza Ceara 2.19 2.98 3.38 1 1 6 2
df_test.shape:
(76544, 11)
df_test.head()
country league home_team away_team home_odds draw_odds away_odds home_score away_score dow month
0 World Club Friendly Westerlo Gent 2.93 3.47 2.19 NaN NaN 4 6
1 Malaysia Super League Johor DT Selangor 1.27 5.59 8.26 NaN NaN 4 6
2 Argentina Reserve League Lanus 2 River Plate 2 2.54 3.12 2.65 NaN NaN 4 6
3 Asia AFC Cup Bali United Kedah 1.58 4.08 4.93 NaN NaN 4 6
4 Ethiopia Premier League Defence Force Adama City 2.93 2.16 3.38 NaN NaN 4 6
df_test.shape:
(599, 11)
I perform encoding in sklearn using pandas as:
def encode_features(df_train, df_test):
features = ['country', 'league', 'home_team', 'away_team']
df_combined = pd.concat([df_train[features], df_test[features]])
for feature in features:
le = preprocessing.LabelEncoder()
le = le.fit(df_combined[feature])
df_train[feature] = le.transform(df_train[feature])
df_test[feature] = le.transform(df_test[feature])
return df_train, df_test
df_train, df_test = encode_features(df_train, df_test)
For encoding those four column with strings, you can use label encoder or one hot encoders. Here is the reference class for your case with label encoder.
import pandas
from sklearn.preprocessing import LabelEncoder
class MultiColumnLabelEncoder:
def __init__(self,columns = None):
self.columns = columns # array of column names to encode
def fit(self,X,y=None):
return self # not relevant here
def transform(self,X):
output = X.copy()
if self.columns is not None:
for col in self.columns:
output[col] = LabelEncoder().fit_transform(output[col])
else:
for colname,col in output.iteritems():
output[colname] = LabelEncoder().fit_transform(col)
return output
def fit_transform(self,X,y=None):
return self.fit(X,y).transform(X)
MultiColumnLabelEncoder(columns = ['country', 'league', 'home_team', 'away_team']).fit_transform(df_train)
I assume this may help for your case.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.