简体   繁体   中英

Convert categorical variables to numeric in R

I have a huge database and I am having many categorical variables. You can watch it here:

> M=data.frame(Type_peau,PEAU_CORPS,SENSIBILITE,IMPERFECTIONS,BRILLANCE ,GRAIN_PEAU,RIDES_VISAGE,ALLERGIES,MAINS,
+              INTERET_ALIM_NATURELLE,INTERET_ORIGINE_GEO,INTERET_VACANCES,INTERET_COMPOSITION,DataQuest1,Priorite2,
+              Priorite1,DataQuest4,Age,Nbre_gift,w,Nbre_achat)
> # pour voir s'il y a des données manquantes
> str(M)
'data.frame':   836 obs. of  21 variables:
 $ Type_peau             : Factor w/ 5 levels "","Grasse","Mixte",..: 3 4 5 3 4 3 3 3 2 3 ...
 $ PEAU_CORPS            : Factor w/ 4 levels "","Normale","Sèche",..: 2 3 3 2 2 2 3 2 3 2 ...
 $ SENSIBILITE           : Factor w/ 4 levels "","Aucune","Fréquente",..: 4 4 4 2 4 3 4 2 4 4 ...
 $ IMPERFECTIONS         : Factor w/ 4 levels "","Fréquente",..: 3 4 3 4 3 2 3 4 3 3 ...
 $ BRILLANCE             : Factor w/ 4 levels "","Aucune","Partout",..: 4 2 2 4 4 4 4 4 3 4 ...
 $ GRAIN_PEAU            : Factor w/ 4 levels "","Dilaté","Fin",..: 4 4 4 2 4 2 4 4 2 4 ...
 $ RIDES_VISAGE          : Factor w/ 4 levels "","Aucune","Très visibles",..: 2 2 2 4 4 2 4 2 4 2 ...
 $ ALLERGIES             : Factor w/ 4 levels "","Non","Oui",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ MAINS                 : Factor w/ 4 levels "","Moites","Normales",..: 3 4 4 3 3 3 3 4 4 4 ...
 $ INTERET_ALIM_NATURELLE: Factor w/ 4 levels "","Beaucoup",..: 2 4 4 4 2 2 2 4 4 2 ...
 $ INTERET_ORIGINE_GEO   : Factor w/ 5 levels "","Beaucoup",..: 2 4 2 5 2 2 2 2 2 2 ...
 $ INTERET_VACANCES      : Factor w/ 6 levels "","À la mer",..: 3 4 2 2 3 2 3 2 3 2 ...
 $ INTERET_COMPOSITION   : Factor w/ 4 levels "","Beaucoup",..: 2 2 2 4 2 2 2 2 4 2 ...
 $ DataQuest1            : Factor w/ 4 levels "-20","20-30",..: 4 3 4 4 4 3 3 2 3 2 ...
 $ Priorite2             : Factor w/ 7 levels "éclatante","hydratée",..: 3 1 3 4 3 2 7 1 4 6 ...
 $ Priorite1             : Factor w/ 7 levels "éclatante","hydratée",..: 4 6 1 5 1 6 1 2 6 4 ...
 $ DataQuest4            : Factor w/ 2 levels "nature","urbain": 2 2 2 2 2 1 2 2 2 2 ...
 $ Age                   : int  32 37 23 44 33 30 43 43 60 31 ...
 $ Nbre_gift             : int  1 4 1 1 2 1 1 1 1 1 ...
 $ w                     : num  0.25 0.25 0.5 0.25 0.5 0 0 0 0 0.75 ...
 $ Nbre_achat            : int  3 4 7 3 6 9 22 13 7 16 ...

I need to convert all categorical variables to numeric automatically. For example for the variable Type_peau , it is :

 head(Type_peau)
[1] Mixte   Normale Sèche   Mixte   Normale Mixte  
Levels:  Grasse Mixte Normale Sèche

I want it :

head(Type_peau)
[1] 2 3 4 2 3 2
Levels: 1 2 3 4

How can I do that automatically for all categorical variables?

You can use unclass() to display numeric values of factor variables :

Type_peau<-as.factor(c("Mixte","Normale","Sèche","Mixte","Normale","Mixte"))
Type_peau
unclass(Type_peau)

To do so on all categorical variables, you can use sapply() :

must_convert<-sapply(M,is.factor)       # logical vector telling if a variable needs to be displayed as numeric
M2<-sapply(M[,must_convert],unclass)    # data.frame of all categorical variables now displayed as numeric
out<-cbind(M[,!must_convert],M2)        # complete data.frame with all variables put together

EDIT : A5C1D2H2I1M1N2O1R2T1's solution works in one step :

out<-data.matrix(M)

It only works if your data.frame doesn't contain any character variable though (otherwise, they'll be put to NA).

Maybe you're after data.matrix . From the function's description:

Return the matrix obtained by converting all the variables in a data frame to numeric mode and then binding them together as the columns of a matrix. Factors and ordered factors are replaced by their internal codes.

Example:

mydf <- data.frame(A = letters[1:5],
                   B = LETTERS[1:5],
                   C = month.abb[1:5],
                   D = 1:5)
str(mydf)
# 'data.frame': 5 obs. of  4 variables:
#  $ A: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
#  $ B: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
#  $ C: Factor w/ 5 levels "Apr","Feb","Jan",..: 3 2 4 1 5
#  $ D: int  1 2 3 4 5
data.matrix(mydf)
#      A B C D
# [1,] 1 1 3 1
# [2,] 2 2 2 2
# [3,] 3 3 4 3
# [4,] 4 4 1 4
# [5,] 5 5 5 5

Replace it all at once with:

mydf[] <- data.matrix(mydf)
mydf
#   A B C D
# 1 1 1 3 1
# 2 2 2 2 2
# 3 3 3 4 3
# 4 4 4 1 4
# 5 5 5 5 5

Of course if you have many more column types, you'll have to decide first how you want to deal with them. For instance, there's the concern that if there's a character column, data.matrix would result in a column of NA values, which is correct. However, the correct concern should be "How would you like to deal with character columns?

Here are two options. You can extend the logic similarly for other column types.

mydf <- data.frame(A = letters[1:5],
                   B = LETTERS[1:5],
                   C = month.abb[1:5],
                   D = 1:5)
mydf$E <- state.abb[1:5]
str(mydf)
# 'data.frame': 5 obs. of  5 variables:
#  $ A: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
#  $ B: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
#  $ C: Factor w/ 5 levels "Apr","Feb","Jan",..: 3 2 4 1 5
#  $ D: int  1 2 3 4 5
#  $ E: chr  "AL" "AK" "AZ" "AR" ...

## You want to convert everything to numeric
data.matrix(data.frame(unclass(mydf))) 
#      A B C D E
# [1,] 1 1 3 1 2
# [2,] 2 2 2 2 1
# [3,] 3 3 4 3 4
# [4,] 4 4 1 4 3
# [5,] 5 5 5 5 5

## You only want to convert factors to numeric
mydf[sapply(mydf, is.factor)] <- data.matrix(mydf[sapply(mydf, is.factor)])
mydf
#   A B C D  E
# 1 1 1 3 1 AL
# 2 2 2 2 2 AK
# 3 3 3 4 3 AZ
# 4 4 4 1 4 AR
# 5 5 5 5 5 CA
library(dplyr)

mydf <- data.frame(A = letters[1:5],
                   B = LETTERS[1:5],
                   C = month.abb[1:5],
                   D = 1:5)
glimpse(mydf)

# Observations: 5
# Variables: 4
# $ A <fctr> a, b, c, d, e
# $ B <fctr> A, B, C, D, E
# $ C <fctr> Jan, Feb, Mar, Apr, May
# $ D <int> 1, 2, 3, 4, 5

Using predicate functions in dplyr

mydf %>% mutate_if(is.factor, as.numeric)

#  A B C D
# 1 1 1 3 1
# 2 2 2 2 2
# 3 3 3 4 3
# 4 4 4 1 4
# 5 5 5 5 5

as.numeric does the job too.

df <- iris
df$newgroup <- as.factor(rep(c(letters[1:10]))) # just another factor
str(df) # Species and newgroup are categorial variables

as.numeric(df$Species) # this returns the levels (numeric) of Species.
                       # Now, we want to apply this automatically to all
                       # categorical variables

# using lapply
i <- sapply(df, is.factor)
df[i] <- lapply(df[i], as.numeric)
str(df)

# using dplyr
#(load df again)
library(dplyr)
df2 <- df %>% mutate_if(is.factor, as.numeric)
str(df2)

# using purrr
library(purrr)
df3 <- df %>% map_if(is.factor, as.numeric)
str(df3)

If you also want to create dummies variables, try

library(dummies)
df.4 <- dummy.data.frame(df, sep = ".")

只是为了添加已经发布的答案,此链接提供了如何将分类数据转换为数字的示例,但如果您对默认转换不满意,还可以将这些数字映射到指定值。

The best and fastest way to do this is by using the code below:

DataFrameYouWant <- data.frame(yourData)
DataFrameYouWant[] <- lapply(DataFrameYouWant, as.integer)

Code above automatically converts all the factor variables in your data to numeric and your data to a data frame. You can specify which columns/variables you want to convert to numeric.

This can be done in one single step as well using factor function.

M$colname = factor(M$colname, levels = c(level1,level2,...), labels = c(label1, label2,...))

Note: It will replace the column.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM