简体   繁体   中英

How do I train test split in Pytorch

What is the best way to encode string values using pytorch?

df_train.head():
  country            league        home_team   away_team  home_odds  draw_odds  away_odds  home_score  away_score  dow  month
0  Brazil  Copa do Nordeste     Sport Recife  Imperatriz       1.36       4.31       7.66           2           2    4      2
1  Brazil  Copa do Nordeste              ABC  America RN       2.62       3.30       2.48           2           1    6      2
2  Brazil  Copa do Nordeste  Frei Paulistano     Nautico       5.19       3.58       1.62           0           2    6      2
3  Brazil  Copa do Nordeste      Botafogo PB   Confianca       2.06       3.16       3.50           1           1    6      2
4  Brazil  Copa do Nordeste        Fortaleza       Ceara       2.19       2.98       3.38           1           1    6      2

df_test.shape:
(76544, 11)

df_test.head()
     country          league      home_team      away_team  home_odds  draw_odds  away_odds  home_score  away_score  dow  month
0      World   Club Friendly       Westerlo           Gent       2.93       3.47       2.19         NaN         NaN    4      6
1   Malaysia    Super League       Johor DT       Selangor       1.27       5.59       8.26         NaN         NaN    4      6
2  Argentina  Reserve League        Lanus 2  River Plate 2       2.54       3.12       2.65         NaN         NaN    4      6
3       Asia         AFC Cup    Bali United          Kedah       1.58       4.08       4.93         NaN         NaN    4      6
4   Ethiopia  Premier League  Defence Force     Adama City       2.93       2.16       3.38         NaN         NaN    4      6

df_test.shape:
(599, 11)

I perform encoding in sklearn using pandas as:

def encode_features(df_train, df_test):
    features = ['country', 'league', 'home_team', 'away_team']
    df_combined = pd.concat([df_train[features], df_test[features]])

    for feature in features:
        le = preprocessing.LabelEncoder()
        le = le.fit(df_combined[feature])
        df_train[feature] = le.transform(df_train[feature])
        df_test[feature] = le.transform(df_test[feature])
    return df_train, df_test


df_train, df_test = encode_features(df_train, df_test)

For encoding those four column with strings, you can use label encoder or one hot encoders. Here is the reference class for your case with label encoder.

import pandas
from sklearn.preprocessing import LabelEncoder

class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

MultiColumnLabelEncoder(columns = ['country', 'league', 'home_team', 'away_team']).fit_transform(df_train)

I assume this may help for your case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM