Convert heterogeneous pandas.DataFrame to homogeneous one

Question

I want to analyse heterogeneous data in the form observations / variables contained in a pandas.DataFrame like this:

   Age   Name     Ok  Result
0   25    Bob   True     1.2
1   41   John  False     0.5
2   30  Alice   True     0.3

For that, I usually convert it to its Numpy representation using pandas.DataFrame.values , thus obtaining:

[[25 'Bob'   True  1.2]
 [41 'John'  False 0.5]
 [30 'Alice' True  0.3]]

which includes only object type if I understand correctly the documentation:

A DataFrame with mixed type columns(eg, str/object, int64, float32) results in an ndarray of the broadest type that accommodates these mixed types (eg, object).

Question : How to convert a pandas.DataFrame (or a numpy.ndarray ) of heterogeneous type to a one with homogeneous numeric type like this:

[[25.0  1.0  1.0  1.2]
 [41.0  2.0  0.0  0.5]
 [30.0  3.0  1.0  0.3]]

where there is a correspondance between 'Bob' and 1.0 , 'John' and 2.0 ... True and 1.0 ...

I ask this because I want to perform a sklearn.decomposition.PCA on all the data, which produces error when dealing with string values.

Here is a minimal ( not ) working example :

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

d  = {'Name': ['Bob', 'John', 'Alice'], 'Age': [25, 41, 30], 'Result' : [1.2, 0.5, 0.3], 'Ok' : [True, False, True]}
df = pd.DataFrame(data=d)

df.info()
print(df)

data = df.values

print(data)

pca = PCA(n_components=all)
pca.fit(data)

Answer 1

First of all if it is a sample of original data then from the concept of PCA there is not way you can get good result from PCA. The main use case for PCA is multivariate data with high dimension. So plugging the value of Bob, Jhon, Alice as 1, 2, 3 you are going to get any good results.As they are unique id not repeated observation from same class. But if it just for learning purpose you can transform the data as follows:

import pandas as pd

d  = {'Name': ['Bob', 'John', 'Alice'], 
      'Age': [25, 41, 30], 
      'Result' : [1.2, 0.5, 0.3], 
      'Ok' : [True, False, True]
      }

df = pd.DataFrame(data=d)

# change the true false to int
df['Ok'] = df.Ok.astype(int)

# put all unique name in the list
name_list = list(df.Name.unique())
# create a name map to replace the value
name_map = {name:id for  id, name in enumerate(name_list)}

# apply the map
df['Name'] = df['Name'].replace(name_map)

# put in to the array
data = df.values

Convert heterogeneous pandas.DataFrame to homogeneous one

Question

1 answers

solution1
0 2018-09-08 12:27:09

Convert heterogeneous pandas.DataFrame to homogeneous one

Question

1 answers

solution1 0 2018-09-08 12:27:09

solution1
0 2018-09-08 12:27:09