简体   繁体   中英

What is the fastest way to convert categorical data with multiple features to numeric in Python?

As an example, I have a mushroom data set with tens of categorical features. I want to load it in pandas.DataFrame and convert to numeric. The samples' features are stored in columns, and the rows represent the different samples. Thus, the conversion to numeric should be applied to columns. In R, I would need only two rows of code for that:

#Load the data. The features are categorical.
mushrooms <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data", header = FALSE, stringsAsFactors = TRUE)

#Convert the features to numeric. The features are stored in columns.
mushroomsNumeric <- data.frame(lapply(mushrooms, as.numeric))

# View the first 5 samples of the original data.
mushrooms[1:5,]
  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23
1  p  x  s  n  t  p  f  c  n   k   e   e   s   s   w   w   p   w   o   p   k   s   u
2  e  x  s  y  t  a  f  c  b   k   e   c   s   s   w   w   p   w   o   p   n   n   g
3  e  b  s  w  t  l  f  c  b   n   e   c   s   s   w   w   p   w   o   p   n   n   m
4  p  x  y  w  t  p  f  c  n   n   e   e   s   s   w   w   p   w   o   p   k   s   u
5  e  x  s  g  f  n  f  w  b   k   t   e   s   s   w   w   p   w   o   e   n   a   g

# View the first 5 samples of the converted data.  
mushroomsNumeric[1:5,]
  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23
1  2  6  3  5  2  7  2  1  2   5   1   4   3   3   8   8   1   3   2   5   3   4   6
2  1  6  3 10  2  1  2  1  1   5   1   3   3   3   8   8   1   3   2   5   4   3   2
3  1  1  3  9  2  4  2  1  1   6   1   3   3   3   8   8   1   3   2   5   4   3   4
4  2  6  4  9  2  7  2  1  2   6   1   4   3   3   8   8   1   3   2   5   3   4   6
5  1  6  3  4  1  6  2  2  1   5   2   4   3   3   8   8   1   3   2   1   4   1   2

What would be the fastest way to do the same in Python with pandas.DataFrame? Thanks!

You can also use LabelEncoder from sklearn library.

from sklearn.preprocessing import LabelEncoder
lbl = LabelEncoder()

# sample data
df = pd.DataFrame({'V1': ['a','b','a','d'],
                   'V2':['c','d','d','c']})

# apply function
df.apply(lbl.fit_transform)

   V1   V2
0   0   0
1   1   1
2   0   1
3   2   0

Use pd.factorize

def f(x):
    return pd.factorize(x)[0]

For factorizing columns

df.apply(f)

For factorizing rows

df.apply(f, 1)

For factorizing entire dataframe together

pd.DataFrame(
    pd.factorize(df.values.ravel())[0].reshape(df.shape),
    df.index, df.columns
)

Here is the summary of two different solutions based on the previous answers in the way they would look in my case.

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Load the data with categorical features.
mushrooms = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data", header = None)

# Convert the categorical features to numeric: solution 1.
labelEncoder = LabelEncoder()
mushroomsNumeric = mushrooms.apply(labelEncoder.fit_transform)

# Convert the categorical features to numeric: solution 2.
mushroomsNumeric2 = pd.DataFrame(
    pd.factorize(mushrooms.values.ravel())[0].reshape(mushrooms.shape),
    mushrooms.index, mushrooms.columns)

mushroomsNumeric.head(5)
Out[35]: 
   0   1   2   3   4   5   6   7   8   9  ...  13  14  15  16  17  18  19  20  \
0   1   5   2   4   1   6   1   0   1   4 ...   2   7   7   0   2   1   4   2   
1   0   5   2   9   1   0   1   0   0   4 ...   2   7   7   0   2   1   4   3   
2   0   0   2   8   1   3   1   0   0   5 ...   2   7   7   0   2   1   4   3   
3   1   5   3   8   1   6   1   0   1   5 ...   2   7   7   0   2   1   4   2   
4   0   5   2   3   0   5   1   1   0   4 ...   2   7   7   0   2   1   0   3   

   21  22  
0   3   5  
1   2   1  
2   2   3  
3   3   5  
4   0   1  

[5 rows x 23 columns]

mushroomsNumeric2.head(5)
Out[36]: 
   0   1   2   3   4   5   6   7   8   9  ...  13  14  15  16  17  18  19  20  \
0   0   1   2   3   4   0   5   6   3   7 ...   2   9   9   0   9  10   0   7   
1   8   1   2  12   4  13   5   6  14   7 ...   2   9   9   0   9  10   0   3   
2   8  14   2   9   4  16   5   6  14   3 ...   2   9   9   0   9  10   0   3   
3   0   1  12   9   4   0   5   6   3   3 ...   2   9   9   0   9  10   0   7   
4   8   1   2  15   5   3   5   9  14   7 ...   2   9   9   0   9  10   8   3   

   21  22  
0   2  11  
1   3  15  
2   3  17  
3   2  11  
4  13  15  

[5 rows x 23 columns]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM