简体   繁体   中英

How to convert pandas data frame string values to numeric values

I have a data set. It has some string columns. I want to convert these string columns. I'm developing a Neural network using this data set. But since the dataset has some string values I can't train my Neural network. What is the best way to convert these string values to Neural Network readable format?

This is the data set that I have

type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,1,0
PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,0,1

I want to convert those type,nameOrig,nameDest fields to neural network readable format.

I have used below method. But I don't know wheater it's right or wrong.

import pandas as pd
from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()

test_set = pd.read_csv('cs.csv')
new_test_set['type'] = enc.fit(new_test_set['type'])

I have gone through below questions. But most of them are not worked for me

How to convert string based data frame to numeric

converting non-numeric to numeric value using Panda libraries

converting non-numeric to numeric value using Panda libraries

You need to encode the string values into numeric ones. What I usually do in this case is creating a table by a non numeric feature, the created table contains all the possible value of that feature. And then, the index of the value in the corresponding features table is used when training a model.

Example:

type_values = ['PAYMENT', 'TRANSFER']

Transformation

First you need to transform the three columns using LableEncoder class.

Encoding Categorical Data

Well here you have the type as categorical value. For this you can use the class OneHotEncoder available in sklearn.preprocessing .

Avoiding Dummy Variable Trap

Then you need to avoid the Dummy Variable Trap by removing any one of the column that are being used to represent type.

Code

Here I have put the sample code for your reference.

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

dataset = pd.read_csv('cs.csv')
X = dataset.iloc[:].values

labelencoder = LabelEncoder()

X[:, 0] = labelencoder.fit_transform(X[:, 0])
X[:, 2] = labelencoder.fit_transform(X[:, 2])
X[:, 5] = labelencoder.fit_transform(X[:, 5])

onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()

# Avoiding the Dummy Variable Trap
X = X[:, 1:]

In this case you can use the datatype category of pandas to map strings to indices (see categorical data ). So it's not necessary to use LabelEncoder or OneHotEncoder of scikit-learn .

import pandas as pd

df = pd.read_csv('54055554.csv', header=0, dtype={
    'type': 'category',  # <--
    'amount': float,
    'nameOrig': str,
    'oldbalanceOrg': float,
    'newbalanceOrig': float,
    'nameDest': str,
    'oldbalanceDest': float,
    'newbalanceDest': float,
    'isFraud': bool,
    'isFlaggedFraud': bool
})

print(dict(enumerate(df['type'].cat.categories)))
# {0: 'PAYMENT', 1: 'TRANSFER'}

print(list(df['type'].cat.codes))
# [0, 0, 1]

The data from the CSV:

type, ...
PAYMENT, ...
PAYMENT, ...
TRANSFER, ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM