简体   繁体   English

如果所有观察值都是 0 或 1,如何自动将列转换为因子数据类型?

[英]How do I automatically convert columns to factor datatype if all the observations are all 0 or 1?

I have a very large dataset where some of the variables are currently integers or doubles, but should be factors.我有一个非常大的数据集,其中一些变量当前是整数或双精度数,但应该是因子。 Since these observations in these columns are either 0 , 1 , or NA , how do I convert all of them to factors in dplyr?由于这些列中的这些观察结果是01NA ,我如何将它们全部转换为 dplyr 中的因子?

The canonical dplyr-way would be to write a custom predicate function that returns TRUE or FALSE for each column depending on whether the conditions are matched and use this function inside across(where(predicate_function), ...) .规范的 dplyr 方法是编写一个自定义谓词函数,该函数根据条件是否匹配为每列返回TRUEFALSE ,并在 cross across(where(predicate_function), ...)内使用此函数。

Below I borrow the example data from @Tob and add some variations (one column is 0 , 1 but double, one column contains NA s, one column is a numeric column which contains other values).下面我从@Tob 借用了示例数据并添加了一些变体(一列是01但双倍,一列包含NA ,一列是包含其他值的数字列)。

library(dplyr)

test_data <- tibble(strings = c("a", "b", "c", "d", "e"), 
                    col_2 = c(1, 0, 0, 0, NA), 
                    col_3 = as.double(c(0, 1, 1, 0, 1)),
                    col_4 = c(0L, 1L, 1L, 0L, 1L),
                    col_5 = 1:5)

# let's have a look at the data and the column types
test_data

#> # A tibble: 5 x 5
#>   strings col_2 col_3 col_4 col_5
#>   <chr>   <dbl> <dbl> <int> <int>
#> 1 a           1     0     0     1
#> 2 b           0     1     1     2
#> 3 c           0     1     1     3
#> 4 d           0     0     0     4
#> 5 e          NA     1     1     5

# predicate function
is_01_col <- function(x) {
  all(unique(x) %in% c(0, 1, NA))
}

test_data %>% 
  mutate(across(where(is_01_col), as.factor)) %>%
  glimpse
#> Rows: 5
#> Columns: 5
#> $ strings <chr> "a", "b", "c", "d", "e"
#> $ col_2   <fct> 1, 0, 0, 0, NA
#> $ col_3   <fct> 0, 1, 1, 0, 1
#> $ col_4   <fct> 0, 1, 1, 0, 1
#> $ col_5   <int> 1, 2, 3, 4, 5

Created on 2021-07-26 by the reprex package (v0.3.0)reprex 包(v0.3.0) 于 2021 年 7 月 26 日创建

This is what I might do but I don't know how fast it will if your data is large这就是我可能会做的,但如果您的数据很大,我不知道它会多快

# Create some data
test_data <- data.frame(strings = c("a", "b", "c", "d", "e"), 
                col_2 = c(1, 0, 0, 0, 1), 
                col_3 = c( 0,1, 1, 0, 1))


# Find columns that are only 0s and 1s
cols_to_convert <- names(test_data)[lapply(test_data, function(x) identical(sort(unique(x)),  c(0,1)))  == TRUE] 

# Convert these columns to factors 
new_data <- test_data %>% mutate(across(all_of(cols_to_convert),  ~ as.factor(.x)))

# Check that the columns are factors
lapply(new_data, class)


Another dplyr approach to reach your goal.实现目标的另一种 dplyr 方法。 I used the built-in dataset mtcars because some columns ( vs and am ) of type double are binary (0 and 1).我使用了内置数据集mtcars因为一些double类型的列( vsam )是二进制(0 和 1)。

df <- mtcars %>% 
  mutate(across(where( ~ setequal(na.omit(.x), 0:1)), as.factor))

glimpse(df)
# Rows: 32
# Columns: 11
# $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2,~
# $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4,~
# $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140~
# $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 18~
# $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92,~
# $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.1~
# $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.~
# $ vs   <fct> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1,~
# $ am   <fct> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,~
# $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4,~
# $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1,~

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我如何将所有因子列转换为具有与字符串列表匹配的别名的数字? - How do I convert all factor columns to numeric that have colnames matching from a list of strings? 如何对具有相同名称的所有列按行对所有观察值求和? - How do I sum all observations row-wise for all columns that have the same name? 如何删除除模式最后一个之外的所有观察结果? - How do I drop all observations except the last of a pattern? R-将数据框中所有列的数据类型从字符动态转换为数字 - R - convert datatype of all columns in a dataframe from character to numeric dynamically 将所有因子列转换为 data.frame 中的字符而不影响非因子列 - convert all factor columns to character in a data.frame without affecting non-factor columns 将所有因子列转换为相同因子 - Transform all factor columns to the same factor 如何在x轴上添加一个表示ggplot2中所有观察值的因子水平? - How to add a factor level on the x-axis that represents all the observations in ggplot2? 当其中一个观察满足特定条件时,如何删除组中的所有行? - How do I drop all the rows within a group when one of the observations meets a certain condition? 将因子转换为逻辑数据类型 - Convert factor into logical datatype 如何将因子转换为日期格式? - How do I convert a factor into date format?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM