繁体   English   中英

将所有数据框字符列转换为因子

[英]Convert all data frame character columns to factors

给定具有各种类型列的(预先存在的)数据框,将其所有字符列转换为因子而不影响任何其他类型的列的最简单方法是什么?

这是一个示例data.frame

df <- data.frame(A = factor(LETTERS[1:5]),
                 B = 1:5, C = as.logical(c(1, 1, 0, 0, 1)),
                 D = letters[1:5],
                 E = paste(LETTERS[1:5], letters[1:5]),
                 stringsAsFactors = FALSE)
df
#   A B     C D   E
# 1 A 1  TRUE a A a
# 2 B 2  TRUE b B b
# 3 C 3 FALSE c C c
# 4 D 4 FALSE d D d
# 5 E 5  TRUE e E e
str(df)
# 'data.frame':  5 obs. of  5 variables:
#  $ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
#  $ B: int  1 2 3 4 5
#  $ C: logi  TRUE TRUE FALSE FALSE TRUE
#  $ D: chr  "a" "b" "c" "d" ...
#  $ E: chr  "A a" "B b" "C c" "D d" ...

我知道我可以做到:

df$D <- as.factor(df$D)
df$E <- as.factor(df$E)

有没有办法让这个过程自动化一点?

Roland 的回答非常适合这个特定问题,但我想我会分享一种更通用的方法。

DF <- data.frame(x = letters[1:5], y = 1:5, z = LETTERS[1:5], 
                 stringsAsFactors=FALSE)
str(DF)
# 'data.frame':  5 obs. of  3 variables:
#  $ x: chr  "a" "b" "c" "d" ...
#  $ y: int  1 2 3 4 5
#  $ z: chr  "A" "B" "C" "D" ...

## The conversion
DF[sapply(DF, is.character)] <- lapply(DF[sapply(DF, is.character)], 
                                       as.factor)
str(DF)
# 'data.frame':  5 obs. of  3 variables:
#  $ x: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
#  $ y: int  1 2 3 4 5
#  $ z: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5

对于转换,赋值的左侧( DF[sapply(DF, is.character)] )对字符列进行子集化。 在右侧,对于该子集,您可以使用lapply执行您需要执行的任何转换。 R 足够聪明,可以用结果替换原始列。

这样做的方便之处在于,如果您想走另一条路或进行其他转换,只需在左侧更改您要查找的内容并在右侧指定要更改的内容即可。

DF <- data.frame(x=letters[1:5], y=1:5, stringsAsFactors=FALSE)

str(DF)
#'data.frame':  5 obs. of  2 variables:
# $ x: chr  "a" "b" "c" "d" ...
# $ y: int  1 2 3 4 5

as.data.frame的(烦人的)默认值是将所有字符列转换为因子列。 你可以在这里使用它:

DF <- as.data.frame(unclass(DF))
str(DF)
#'data.frame':  5 obs. of  2 variables:
# $ x: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
# $ y: int  1 2 3 4 5

正如@Raf Z 对这个问题的评论,dplyr 现在有 mutate_if。 超级有用,简单易读。

> str(df)
'data.frame':   5 obs. of  5 variables:
 $ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
 $ B: int  1 2 3 4 5
 $ C: logi  TRUE TRUE FALSE FALSE TRUE
 $ D: chr  "a" "b" "c" "d" ...
 $ E: chr  "A a" "B b" "C c" "D d" ...

> df <- df %>% mutate_if(is.character,as.factor)

> str(df)
'data.frame':   5 obs. of  5 variables:
 $ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
 $ B: int  1 2 3 4 5
 $ C: logi  TRUE TRUE FALSE FALSE TRUE
 $ D: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
 $ E: Factor w/ 5 levels "A a","B b","C c",..: 1 2 3 4 5

使用dplyr

library(dplyr)

df <- data.frame(A = factor(LETTERS[1:5]),
                 B = 1:5, C = as.logical(c(1, 1, 0, 0, 1)),
                 D = letters[1:5],
                 E = paste(LETTERS[1:5], letters[1:5]),
                 stringsAsFactors = FALSE)

str(df)

我们得到:

'data.frame':   5 obs. of  5 variables:
 $ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
 $ B: int  1 2 3 4 5
 $ C: logi  TRUE TRUE FALSE FALSE TRUE
 $ D: chr  "a" "b" "c" "d" ...
 $ E: chr  "A a" "B b" "C c" "D d" ...

现在,我们可以将所有chr转换为factors

df <- df%>%mutate_if(is.character, as.factor)
str(df)

我们得到:

'data.frame':   5 obs. of  5 variables:
 $ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
 $ B: int  1 2 3 4 5
 $ C: logi  TRUE TRUE FALSE FALSE TRUE
 $ D: chr  "a" "b" "c" "d" ...
 $ E: chr  "A a" "B b" "C c" "D d" ...

让我们也提供其他解决方案:

带基础包:

df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], 
                                                           as.factor)

使用dplyr 1.0.0

df <- df%>%mutate(across(where(is.factor), as.character))

使用purrr包:

library(purrr)

df <- df%>% modify_if(is.factor, as.character) 

最简单的方法是使用下面给出的代码。 它会自动完成将所有变量转换为 R 中数据帧中的因子的整个过程。它对我来说非常好。 food_cat 这里是我正在使用的数据集。 将其更改为您正在处理的那个。

    for(i in 1:ncol(food_cat)){

food_cat[,i] <- as.factor(food_cat[,i])

}

我曾经做过一个简单的for循环。 正如@A5C1D2H2I1M1N2O1R2T1 的回答, lapply是一个不错的解决方案。 但是如果你转换了所有的列,你之前需要一个data.frame ,否则你最终会得到一个list 执行时间差异很小。

 mm2N=mm2New[,10:18]
 str(mm2N)
'data.frame':   35487 obs. of  9 variables:
 $ bb    : int  4 6 2 3 3 2 5 2 1 2 ...
 $ vabb  : int  -3 -3 -2 -2 -3 -1 0 0 3 3 ...
 $ bb55  : int  7 6 3 4 4 4 9 2 5 4 ...
 $ vabb55: int  -3 -1 0 -1 -2 -2 -3 0 -1 3 ...
 $ zr    : num  0 -2 -1 1 -1 -1 -1 1 1 0 ...
 $ z55r  : num  -2 -2 0 1 -2 -2 -2 1 -1 1 ...
 $ fechar: num  0 -1 1 0 1 1 0 0 1 0 ...
 $ varr  : num  3 3 1 1 1 1 4 1 1 3 ...
 $ minmax: int  3 0 4 6 6 6 0 6 6 1 ...

 # For solution
 t1=Sys.time()
 for(i in 1:ncol(mm2N)) mm2N[,i]=as.factor(mm2N[,i])
 Sys.time()-t1
Time difference of 0.2020121 secs
 str(mm2N)
'data.frame':   35487 obs. of  9 variables:
 $ bb    : Factor w/ 6 levels "1","2","3","4",..: 4 6 2 3 3 2 5 2 1 2 ...
 $ vabb  : Factor w/ 7 levels "-3","-2","-1",..: 1 1 2 2 1 3 4 4 7 7 ...
 $ bb55  : Factor w/ 8 levels "2","3","4","5",..: 6 5 2 3 3 3 8 1 4 3 ...
 $ vabb55: Factor w/ 7 levels "-3","-2","-1",..: 1 3 4 3 2 2 1 4 3 7 ...
 $ zr    : Factor w/ 5 levels "-2","-1","0",..: 3 1 2 4 2 2 2 4 4 3 ...
 $ z55r  : Factor w/ 5 levels "-2","-1","0",..: 1 1 3 4 1 1 1 4 2 4 ...
 $ fechar: Factor w/ 3 levels "-1","0","1": 2 1 3 2 3 3 2 2 3 2 ...
 $ varr  : Factor w/ 5 levels "1","2","3","4",..: 3 3 1 1 1 1 4 1 1 3 ...
 $ minmax: Factor w/ 7 levels "0","1","2","3",..: 4 1 5 7 7 7 1 7 7 2 ...

 #lapply solution
 mm2N=mm2New[,10:18]
 t1=Sys.time()
 mm2N <- lapply(mm2N, as.factor)
 Sys.time()-t1
Time difference of 0.209012 secs
 str(mm2N)
List of 9
 $ bb    : Factor w/ 6 levels "1","2","3","4",..: 4 6 2 3 3 2 5 2 1 2 ...
 $ vabb  : Factor w/ 7 levels "-3","-2","-1",..: 1 1 2 2 1 3 4 4 7 7 ...
 $ bb55  : Factor w/ 8 levels "2","3","4","5",..: 6 5 2 3 3 3 8 1 4 3 ...
 $ vabb55: Factor w/ 7 levels "-3","-2","-1",..: 1 3 4 3 2 2 1 4 3 7 ...
 $ zr    : Factor w/ 5 levels "-2","-1","0",..: 3 1 2 4 2 2 2 4 4 3 ...
 $ z55r  : Factor w/ 5 levels "-2","-1","0",..: 1 1 3 4 1 1 1 4 2 4 ...
 $ fechar: Factor w/ 3 levels "-1","0","1": 2 1 3 2 3 3 2 2 3 2 ...
 $ varr  : Factor w/ 5 levels "1","2","3","4",..: 3 3 1 1 1 1 4 1 1 3 ...
 $ minmax: Factor w/ 7 levels "0","1","2","3",..: 4 1 5 7 7 7 1 7 7 2 ...

 #data.frame lapply solution
 mm2N=mm2New[,10:18]
 t1=Sys.time()
 mm2N <- data.frame(lapply(mm2N, as.factor))
 Sys.time()-t1
Time difference of 0.2010119 secs
 str(mm2N)
'data.frame':   35487 obs. of  9 variables:
 $ bb    : Factor w/ 6 levels "1","2","3","4",..: 4 6 2 3 3 2 5 2 1 2 ...
 $ vabb  : Factor w/ 7 levels "-3","-2","-1",..: 1 1 2 2 1 3 4 4 7 7 ...
 $ bb55  : Factor w/ 8 levels "2","3","4","5",..: 6 5 2 3 3 3 8 1 4 3 ...
 $ vabb55: Factor w/ 7 levels "-3","-2","-1",..: 1 3 4 3 2 2 1 4 3 7 ...
 $ zr    : Factor w/ 5 levels "-2","-1","0",..: 3 1 2 4 2 2 2 4 4 3 ...
 $ z55r  : Factor w/ 5 levels "-2","-1","0",..: 1 1 3 4 1 1 1 4 2 4 ...
 $ fechar: Factor w/ 3 levels "-1","0","1": 2 1 3 2 3 3 2 2 3 2 ...
 $ varr  : Factor w/ 5 levels "1","2","3","4",..: 3 3 1 1 1 1 4 1 1 3 ...
 $ minmax: Factor w/ 7 levels "0","1","2","3",..: 4 1 5 7 7 7 1 7 7 2 ...

我注意到“[”索引列在迭代时无法创建级别:

for ( a_feature in convert.to.factors) {
feature.df[a_feature] <- 因子(feature.df[a_feature]) }

它创建,例如“状态”列:

状态:因子 w/ 1 级 "c(\\"Success\\", \\"Fail\\")" : NA NA NA ...

这是通过使用“[[”索引来补救的:

for ( a_feature in convert.to.factors) {
feature.df[[a_feature]] <- 因子(feature.df[[a_feature]]) }

根据需要给予:

. 状态:具有 2 个级别“成功”、“失败”的因素:1 1 2 1 ...

根据@Roland 的回答和@Paul de Barros 的评论,我得出以下结论:

    df <- data.frame(A = factor(LETTERS[1:5]),
                 B = 1:5, C = as.logical(c(1, 1, 0, 0, 1)),
                 D = letters[1:5],
                 E = paste(LETTERS[1:5], letters[1:5]),
                 stringsAsFactors = FALSE)
   
   df<-as.data.frame(unclass(df),stringsAsFactors=TRUE)
   str(df)

实际上而且简单地似乎有效。

> str(df)
'data.frame':   5 obs. of  5 variables:
 $ A: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
 $ B: int  1 2 3 4 5
 $ C: logi  TRUE TRUE FALSE FALSE TRUE
 $ D: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
 $ E: Factor w/ 5 levels "A a","B b","C c",..: 1 2 3 4 5

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM