I have a data set. It has some string columns. I want to convert these string columns. I'm developing a Neural network using this data set. But since the dataset has some string values I can't train my Neural network. What is the best way to convert these string values to Neural Network readable format?
This is the data set that I have
type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,1,0
PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,0,1
I want to convert those type,nameOrig,nameDest fields to neural network readable format.
I have used below method. But I don't know wheater it's right or wrong.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
test_set = pd.read_csv('cs.csv')
new_test_set['type'] = enc.fit(new_test_set['type'])
I have gone through below questions. But most of them are not worked for me
How to convert string based data frame to numeric
converting non-numeric to numeric value using Panda libraries
converting non-numeric to numeric value using Panda libraries
You need to encode the string values into numeric ones. What I usually do in this case is creating a table by a non numeric feature, the created table contains all the possible value of that feature. And then, the index of the value in the corresponding features table is used when training a model.
Example:
type_values = ['PAYMENT', 'TRANSFER']
First you need to transform the three columns using LableEncoder
class.
Well here you have the type as categorical value. For this you can use the class OneHotEncoder
available in sklearn.preprocessing
.
Then you need to avoid the Dummy Variable Trap by removing any one of the column that are being used to represent type.
Here I have put the sample code for your reference.
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
dataset = pd.read_csv('cs.csv')
X = dataset.iloc[:].values
labelencoder = LabelEncoder()
X[:, 0] = labelencoder.fit_transform(X[:, 0])
X[:, 2] = labelencoder.fit_transform(X[:, 2])
X[:, 5] = labelencoder.fit_transform(X[:, 5])
onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
# Avoiding the Dummy Variable Trap
X = X[:, 1:]
In this case you can use the datatype category
of pandas to map strings to indices (see categorical data ). So it's not necessary to use LabelEncoder or OneHotEncoder of scikit-learn .
import pandas as pd
df = pd.read_csv('54055554.csv', header=0, dtype={
'type': 'category', # <--
'amount': float,
'nameOrig': str,
'oldbalanceOrg': float,
'newbalanceOrig': float,
'nameDest': str,
'oldbalanceDest': float,
'newbalanceDest': float,
'isFraud': bool,
'isFlaggedFraud': bool
})
print(dict(enumerate(df['type'].cat.categories)))
# {0: 'PAYMENT', 1: 'TRANSFER'}
print(list(df['type'].cat.codes))
# [0, 0, 1]
The data from the CSV:
type, ...
PAYMENT, ...
PAYMENT, ...
TRANSFER, ...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.