Convert numerical variables into factors when the number of levels is lower than a given threshold with dplyr

Question

I want to convert numerical variables into factors when the number of levels is lower than a given threshold with dplyr.

This would be most useful with binary variables coded as numerical '0/1'.

example data:

threshold<-5

data<-data.frame(binary1=rep(c(0,1), 5), binary_2=sample(c(0,1), 10, replace = TRUE), multilevel=sample(c(1:4), 10, replace=TRUE), numerical=1:10)

> data
   binary1 binary_2 multilevel numerical
1        0        1          2         1
2        1        0          3         2
3        0        1          2         3
4        1        0          1         4
5        0        1          2         5
6        1        1          4         6
7        0        1          1         7
8        1        1          3         8
9        0        1          1         9
10       1        0          4        10

sapply(data, class)
   binary1   binary_2 multilevel  numerical 
 "numeric"  "numeric"  "integer"  "integer"

I could easily transform all variables into factors with mutate(), across() and where(), like this:

data<-data%>%mutate(across(where(is.numeric), as.factor))

> sapply(data, class)
   binary1   binary_2 multilevel  numerical 
  "factor"   "factor"   "factor"   "factor"

However, I cant find a way to mutate with multiple conditions, including my threshold argument, for the where() function. I wanted to have this output:

sapply(data, class)
   binary1   binary_2 multilevel  numerical 
 "factor"  "factor"  "factor"  "integer"

Tried the following, but failed:

data%>%mutate(across(where(is.numeric & length(unique(.x))<threshold), as.factor))

error message:

Error: Problem with `mutate()` input `..1`.
x object '.x' not found
ℹ Input `..1` is `across(where(!is.factor & length(unique(.x)) < threshold), as.factor)`.
Run `rlang::last_error()` to see where the error occurred.

Maybe I don't understand across() and where() well enough. Suggestions are welcomed.

Additional question: why including a negation operator (.) before is?factor gets me an error when the version without (!) is perfectly fine?

data<-data%>%mutate(across(where(!is.factor), as.factor))

Error: Problem with mutate() input ..1 . x invalid argument type ℹ Input ..1 is across(where(.is,factor). as.factor) . Run rlang::last_error() to see where the error occurred.

Answer 1

Use an anonymous or lambda function in where .

library(dplyr)

data <- data %>% 
     mutate(across(where(~is.numeric(.) && n_distinct(.) < threshold), factor))

sapply(data, class)

#   binary1   binary_2 multilevel  numerical 
#  "factor"   "factor"   "factor"  "integer"

To answer your additional question, .is.factor is not a function like is.factor . Use the function in the same way as above.

data %>% mutate(across(where(~!is.factor(.)), factor))

Answer 2

Using data.table

library(data.table)
data1 <- setDT(data)[, lapply(.SD, function(x) 
        if(is.numeric(x) && uniqueN(x) < threshold) factor(x) else x)]

Convert numerical variables into factors when the number of levels is lower than a given threshold with dplyr

Question

2 answers

solution1
2 ACCPTED 2021-03-31 03:11:55

solution2
1 2021-03-31 18:13:38

Convert numerical variables into factors when the number of levels is lower than a given threshold with dplyr

Question

2 answers

solution1 2 ACCPTED 2021-03-31 03:11:55

solution2 1 2021-03-31 18:13:38

solution1
2 ACCPTED 2021-03-31 03:11:55

solution2
1 2021-03-31 18:13:38