简体   繁体   中英

How do I automatically convert columns to factor datatype if all the observations are all 0 or 1?

I have a very large dataset where some of the variables are currently integers or doubles, but should be factors. Since these observations in these columns are either 0 , 1 , or NA , how do I convert all of them to factors in dplyr?

The canonical dplyr-way would be to write a custom predicate function that returns TRUE or FALSE for each column depending on whether the conditions are matched and use this function inside across(where(predicate_function), ...) .

Below I borrow the example data from @Tob and add some variations (one column is 0 , 1 but double, one column contains NA s, one column is a numeric column which contains other values).

library(dplyr)

test_data <- tibble(strings = c("a", "b", "c", "d", "e"), 
                    col_2 = c(1, 0, 0, 0, NA), 
                    col_3 = as.double(c(0, 1, 1, 0, 1)),
                    col_4 = c(0L, 1L, 1L, 0L, 1L),
                    col_5 = 1:5)

# let's have a look at the data and the column types
test_data

#> # A tibble: 5 x 5
#>   strings col_2 col_3 col_4 col_5
#>   <chr>   <dbl> <dbl> <int> <int>
#> 1 a           1     0     0     1
#> 2 b           0     1     1     2
#> 3 c           0     1     1     3
#> 4 d           0     0     0     4
#> 5 e          NA     1     1     5

# predicate function
is_01_col <- function(x) {
  all(unique(x) %in% c(0, 1, NA))
}

test_data %>% 
  mutate(across(where(is_01_col), as.factor)) %>%
  glimpse
#> Rows: 5
#> Columns: 5
#> $ strings <chr> "a", "b", "c", "d", "e"
#> $ col_2   <fct> 1, 0, 0, 0, NA
#> $ col_3   <fct> 0, 1, 1, 0, 1
#> $ col_4   <fct> 0, 1, 1, 0, 1
#> $ col_5   <int> 1, 2, 3, 4, 5

Created on 2021-07-26 by the reprex package (v0.3.0)

This is what I might do but I don't know how fast it will if your data is large

# Create some data
test_data <- data.frame(strings = c("a", "b", "c", "d", "e"), 
                col_2 = c(1, 0, 0, 0, 1), 
                col_3 = c( 0,1, 1, 0, 1))


# Find columns that are only 0s and 1s
cols_to_convert <- names(test_data)[lapply(test_data, function(x) identical(sort(unique(x)),  c(0,1)))  == TRUE] 

# Convert these columns to factors 
new_data <- test_data %>% mutate(across(all_of(cols_to_convert),  ~ as.factor(.x)))

# Check that the columns are factors
lapply(new_data, class)


Another dplyr approach to reach your goal. I used the built-in dataset mtcars because some columns ( vs and am ) of type double are binary (0 and 1).

df <- mtcars %>% 
  mutate(across(where( ~ setequal(na.omit(.x), 0:1)), as.factor))

glimpse(df)
# Rows: 32
# Columns: 11
# $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2,~
# $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4,~
# $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140~
# $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 18~
# $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92,~
# $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.1~
# $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.~
# $ vs   <fct> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,~
# $ am   <fct> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,~
# $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4,~
# $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1,~

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM