简体   繁体   English

将分类变量转换为 R 中的数字

[英]Convert categorical variables to numeric in R

I have a huge database and I am having many categorical variables.我有一个庞大的数据库,并且有很多分类变量。 You can watch it here:你可以在这里观看:

> M=data.frame(Type_peau,PEAU_CORPS,SENSIBILITE,IMPERFECTIONS,BRILLANCE ,GRAIN_PEAU,RIDES_VISAGE,ALLERGIES,MAINS,
+              INTERET_ALIM_NATURELLE,INTERET_ORIGINE_GEO,INTERET_VACANCES,INTERET_COMPOSITION,DataQuest1,Priorite2,
+              Priorite1,DataQuest4,Age,Nbre_gift,w,Nbre_achat)
> # pour voir s'il y a des données manquantes
> str(M)
'data.frame':   836 obs. of  21 variables:
 $ Type_peau             : Factor w/ 5 levels "","Grasse","Mixte",..: 3 4 5 3 4 3 3 3 2 3 ...
 $ PEAU_CORPS            : Factor w/ 4 levels "","Normale","Sèche",..: 2 3 3 2 2 2 3 2 3 2 ...
 $ SENSIBILITE           : Factor w/ 4 levels "","Aucune","Fréquente",..: 4 4 4 2 4 3 4 2 4 4 ...
 $ IMPERFECTIONS         : Factor w/ 4 levels "","Fréquente",..: 3 4 3 4 3 2 3 4 3 3 ...
 $ BRILLANCE             : Factor w/ 4 levels "","Aucune","Partout",..: 4 2 2 4 4 4 4 4 3 4 ...
 $ GRAIN_PEAU            : Factor w/ 4 levels "","Dilaté","Fin",..: 4 4 4 2 4 2 4 4 2 4 ...
 $ RIDES_VISAGE          : Factor w/ 4 levels "","Aucune","Très visibles",..: 2 2 2 4 4 2 4 2 4 2 ...
 $ ALLERGIES             : Factor w/ 4 levels "","Non","Oui",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ MAINS                 : Factor w/ 4 levels "","Moites","Normales",..: 3 4 4 3 3 3 3 4 4 4 ...
 $ INTERET_ALIM_NATURELLE: Factor w/ 4 levels "","Beaucoup",..: 2 4 4 4 2 2 2 4 4 2 ...
 $ INTERET_ORIGINE_GEO   : Factor w/ 5 levels "","Beaucoup",..: 2 4 2 5 2 2 2 2 2 2 ...
 $ INTERET_VACANCES      : Factor w/ 6 levels "","À la mer",..: 3 4 2 2 3 2 3 2 3 2 ...
 $ INTERET_COMPOSITION   : Factor w/ 4 levels "","Beaucoup",..: 2 2 2 4 2 2 2 2 4 2 ...
 $ DataQuest1            : Factor w/ 4 levels "-20","20-30",..: 4 3 4 4 4 3 3 2 3 2 ...
 $ Priorite2             : Factor w/ 7 levels "éclatante","hydratée",..: 3 1 3 4 3 2 7 1 4 6 ...
 $ Priorite1             : Factor w/ 7 levels "éclatante","hydratée",..: 4 6 1 5 1 6 1 2 6 4 ...
 $ DataQuest4            : Factor w/ 2 levels "nature","urbain": 2 2 2 2 2 1 2 2 2 2 ...
 $ Age                   : int  32 37 23 44 33 30 43 43 60 31 ...
 $ Nbre_gift             : int  1 4 1 1 2 1 1 1 1 1 ...
 $ w                     : num  0.25 0.25 0.5 0.25 0.5 0 0 0 0 0.75 ...
 $ Nbre_achat            : int  3 4 7 3 6 9 22 13 7 16 ...

I need to convert all categorical variables to numeric automatically.我需要自动将所有分类变量转换为数字。 For example for the variable Type_peau , it is :例如对于变量Type_peau ,它是:

 head(Type_peau)
[1] Mixte   Normale Sèche   Mixte   Normale Mixte  
Levels:  Grasse Mixte Normale Sèche

I want it :我要它 :

head(Type_peau)
[1] 2 3 4 2 3 2
Levels: 1 2 3 4

How can I do that automatically for all categorical variables?如何为所有分类变量自动执行此操作?

You can use unclass() to display numeric values of factor variables :您可以使用unclass()来显示因子变量的数值:

Type_peau<-as.factor(c("Mixte","Normale","Sèche","Mixte","Normale","Mixte"))
Type_peau
unclass(Type_peau)

To do so on all categorical variables, you can use sapply() :要对所有分类变量执行此操作,您可以使用sapply()

must_convert<-sapply(M,is.factor)       # logical vector telling if a variable needs to be displayed as numeric
M2<-sapply(M[,must_convert],unclass)    # data.frame of all categorical variables now displayed as numeric
out<-cbind(M[,!must_convert],M2)        # complete data.frame with all variables put together

EDIT : A5C1D2H2I1M1N2O1R2T1's solution works in one step :编辑: A5C1D2H2I1M1N2O1R2T1 的解决方案一步工作:

out<-data.matrix(M)

It only works if your data.frame doesn't contain any character variable though (otherwise, they'll be put to NA).它仅在您的 data.frame 不包含任何字符变量时才有效(否则,它们将被放入 NA)。

Maybe you're after data.matrix .也许你在追求data.matrix From the function's description:从函数的描述:

Return the matrix obtained by converting all the variables in a data frame to numeric mode and then binding them together as the columns of a matrix.返回通过将数据框中的所有变量转换为数值模式,然后将它们绑定在一起作为矩阵的列而获得的矩阵。 Factors and ordered factors are replaced by their internal codes.因子和有序因子由它们的内部代码代替。

Example:例子:

mydf <- data.frame(A = letters[1:5],
                   B = LETTERS[1:5],
                   C = month.abb[1:5],
                   D = 1:5)
str(mydf)
# 'data.frame': 5 obs. of  4 variables:
#  $ A: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
#  $ B: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
#  $ C: Factor w/ 5 levels "Apr","Feb","Jan",..: 3 2 4 1 5
#  $ D: int  1 2 3 4 5
data.matrix(mydf)
#      A B C D
# [1,] 1 1 3 1
# [2,] 2 2 2 2
# [3,] 3 3 4 3
# [4,] 4 4 1 4
# [5,] 5 5 5 5

Replace it all at once with:一次全部替换为:

mydf[] <- data.matrix(mydf)
mydf
#   A B C D
# 1 1 1 3 1
# 2 2 2 2 2
# 3 3 3 4 3
# 4 4 4 1 4
# 5 5 5 5 5

Of course if you have many more column types, you'll have to decide first how you want to deal with them.当然,如果您有更多的列类型,则必须首先决定如何处理它们。 For instance, there's the concern that if there's a character column, data.matrix would result in a column of NA values, which is correct.例如,有人担心如果有一个character列, data.matrix会导致一列NA值,这是正确的。 However, the correct concern should be "How would you like to deal with character columns?但是,正确的问题应该是“您希望如何处理character列?

Here are two options.这里有两个选项。 You can extend the logic similarly for other column types.您可以类似地为其他列类型扩展逻辑。

mydf <- data.frame(A = letters[1:5],
                   B = LETTERS[1:5],
                   C = month.abb[1:5],
                   D = 1:5)
mydf$E <- state.abb[1:5]
str(mydf)
# 'data.frame': 5 obs. of  5 variables:
#  $ A: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
#  $ B: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
#  $ C: Factor w/ 5 levels "Apr","Feb","Jan",..: 3 2 4 1 5
#  $ D: int  1 2 3 4 5
#  $ E: chr  "AL" "AK" "AZ" "AR" ...

## You want to convert everything to numeric
data.matrix(data.frame(unclass(mydf))) 
#      A B C D E
# [1,] 1 1 3 1 2
# [2,] 2 2 2 2 1
# [3,] 3 3 4 3 4
# [4,] 4 4 1 4 3
# [5,] 5 5 5 5 5

## You only want to convert factors to numeric
mydf[sapply(mydf, is.factor)] <- data.matrix(mydf[sapply(mydf, is.factor)])
mydf
#   A B C D  E
# 1 1 1 3 1 AL
# 2 2 2 2 2 AK
# 3 3 3 4 3 AZ
# 4 4 4 1 4 AR
# 5 5 5 5 5 CA
library(dplyr)

mydf <- data.frame(A = letters[1:5],
                   B = LETTERS[1:5],
                   C = month.abb[1:5],
                   D = 1:5)
glimpse(mydf)

# Observations: 5
# Variables: 4
# $ A <fctr> a, b, c, d, e
# $ B <fctr> A, B, C, D, E
# $ C <fctr> Jan, Feb, Mar, Apr, May
# $ D <int> 1, 2, 3, 4, 5

Using predicate functions in dplyrdplyr使用谓词函数

mydf %>% mutate_if(is.factor, as.numeric)

#  A B C D
# 1 1 1 3 1
# 2 2 2 2 2
# 3 3 3 4 3
# 4 4 4 1 4
# 5 5 5 5 5

as.numeric does the job too. as.numeric也可以完成这项工作。

df <- iris
df$newgroup <- as.factor(rep(c(letters[1:10]))) # just another factor
str(df) # Species and newgroup are categorial variables

as.numeric(df$Species) # this returns the levels (numeric) of Species.
                       # Now, we want to apply this automatically to all
                       # categorical variables

# using lapply
i <- sapply(df, is.factor)
df[i] <- lapply(df[i], as.numeric)
str(df)

# using dplyr
#(load df again)
library(dplyr)
df2 <- df %>% mutate_if(is.factor, as.numeric)
str(df2)

# using purrr
library(purrr)
df3 <- df %>% map_if(is.factor, as.numeric)
str(df3)

If you also want to create dummies variables, try如果您还想创建虚拟变量,请尝试

library(dummies)
df.4 <- dummy.data.frame(df, sep = ".")

只是为了添加已经发布的答案,此链接提供了如何将分类数据转换为数字的示例,但如果您对默认转换不满意,还可以将这些数字映射到指定值。

The best and fastest way to do this is by using the code below:最好和最快的方法是使用下面的代码:

DataFrameYouWant <- data.frame(yourData)
DataFrameYouWant[] <- lapply(DataFrameYouWant, as.integer)

Code above automatically converts all the factor variables in your data to numeric and your data to a data frame.上面的代码会自动将数据中的所有因子变量转换为数字,并将数据转换为数据框。 You can specify which columns/variables you want to convert to numeric.您可以指定要将哪些列/变量转换为数字。

This can be done in one single step as well using factor function.这也可以使用因子函数一步完成。

M$colname = factor(M$colname, levels = c(level1,level2,...), labels = c(label1, label2,...))

Note: It will replace the column.注意:它将替换列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM