简体   繁体   中英

How to resolve a multi-categorical column in a daframe?

I have a data frame having columns namely 'title' and 'cuisines' which contain multiple values of similar category. How to resolve them and convert to numerical form? Also how to replace nan values in such columns?

I thought of trying 'One Hot Encoding' but this would unnecessarily increase the number of columns. Perhaps I want all the categories to separated. Cuisines column has 220 unique cuisines and title section has 24 unique titles.

Example

在此处输入图片说明

Well, one could argue that one-hot-encoding / converting categorical columns to numeric is not "unnecessarily" increasing the number of columns..in fact, that would be a necessity to really pull apart all the different categories into numerical values.

But, if you want to keep the number of columns, you could do something where you take all the unique values in the column and create a dictionary. Then map those back into the column using the dictionary. It'll also deal with your nan , but you'll have to decide what you want to do with those ultimately:

Given:

import pandas as pd
import numpy as np

df = pd.DataFrame([['CASUAL DINING','Malwani, Goan, North Indian'],
                   ['CASUAL DINING,BAR','Malwani, Goan, North Indian'],
                   ['CASUAL DINING','Asian, Modern Indian, Japanese'],
                   ['QUICK BITES',np.nan],
                   ['CAFE','Bar Food'],
                   ['CASUAL DINING', 'South Indian, North Indian']], columns = ['TITLE','CUISINES']) 

Output:

print (df)
               TITLE                        CUISINES
0      CASUAL DINING     Malwani, Goan, North Indian
1  CASUAL DINING,BAR     Malwani, Goan, North Indian
2      CASUAL DINING  Asian, Modern Indian, Japanese
3        QUICK BITES                Tibetan, Chinese
4               CAFE                        Bar Food
5      CASUAL DINING      South Indian, North Indian

Create dictionary of the unique values:

title_unq = list(df['TITLE'].unique())
title_dict = {}
for idx, value in enumerate(title_unq):
    title_dict[value] = idx


cuisines_unq = list(df['CUISINES'].unique())
cuisines_dict = {}
for idx, value in enumerate(cuisines_unq):
    cuisines_dict[value] = idx       

Output:

print (title_dict)
{'CASUAL DINING': 0, 'CASUAL DINING,BAR': 1, 'QUICK BITES': 2, 'CAFE': 3}

print (cuisines_dict)
{'Malwani, Goan, North Indian': 0, 'Asian, Modern Indian, Japanese': 1, 'Tibetan, Chinese': 2, 'Bar Food': 3, 'South Indian, North Indian': 4}

Then use those to replace the values in the columns:

df['TITLE'] = df['TITLE'].map(title_dict)   
df['CUISINES'] = df['CUISINES'].map(cuisines_dict)    

Output:

print (df)
   TITLE  CUISINES
0      0         0
1      1         0
2      0         1
3      2         2
4      3         3
5      0         4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM