如何在忽略 NA 和空白的情况下处理 select 列中具有等于或大于 2 个唯一值的列？

Question

My dataframe looks similar to this:我的 dataframe 看起来与此类似：

 df <- data.frame(ID = c(1, 2, 3, 4, 5),
               color = c(NA, "black", "black", NA, "brown"),
              animal = c("dog", "", "", "", "")
               owner = c("YES", "NO", "NO", "YES", NA))

ID ID	color颜色	animal动物	owner所有者
1 1个	NA北美	dog狗	YES是的
2 2个	black黑色的		NO不
3 3个	black黑色的		NO不
4 4个	NA北美		YES是的
5 5个	brown棕色的		NA北美

I would like to retrieve the column names of all columns with more than 2 unique values while ignoring NA and blanks/empty strings in this calculation.我想检索具有超过 2 个唯一值的所有列的列名，同时在此计算中忽略 NA 和空白/空字符串。

My solution so far:到目前为止我的解决方案：

df_col <- df %>% 
        select_if(function(col) length(unique(na.omit(col)))>1)

df_col <- colnames(df_col)

But I have noticed that na.omit() won't help, since it deletes the whole row.但我注意到 na.omit() 无济于事，因为它会删除整行。

Any help would be appreciated.任何帮助，将不胜感激。 Thank you in advance!先感谢您！

Answer 1

Use n_distinct , which also have na.rm argument, The _if/_at/_all are deprecated in favor of across/where .使用n_distinct ，它也有na.rm参数，不推荐使用_if/_at/_all取而代之的是across/where 。 The empty strings ( '' ) can be checked with nzchar which returns a TRUE only if it is non-empty, thus subset the elements of the columns with nzchar and then apply n_distinct column wise and create the condition to select only those columns and then get the names可以使用nzchar检查空字符串 ( '' )，它仅在非空时才返回 TRUE，因此使用nzchar对列的元素进行子集化，然后按列应用n_distinct并仅将条件创建到select这些列，然后得到names

library(dplyr)
df %>%
    select(where(~ n_distinct(.x[nzchar(.x)], na.rm = TRUE) > 1)) %>%
     names

-output -输出

[1] "ID"    "color" "owner"

An option is also to convert the "" to NA ( na_if ), perhaps it may be slightly compact一个选项也是将""转换为NA ( na_if )，也许它可能会稍微紧凑

df %>% 
  select(where(~ n_distinct(na_if(.x, ""), na.rm = TRUE) > 1)) %>% 
  names

Answer 2

You can do replace values with "" with NA ( na_if ), and then use lengths to count the number of unique values.您可以使用 NA ( na_if ) 将值替换为“”，然后使用lengths来计算唯一值的数量。 Use names and which to get the vector of names that have more than two values.使用names和which获取具有两个以上值的名称向量。

names(which(lengths(lapply(na_if(df, ""), \(x) unique(x[!is.na(x)]))) >= 2))
[1] "ID"    "color" "owner"

Combining with n_distinct :结合n_distinct ：

colnames(df)[lapply(na_if(df, ""), n_distinct, na.rm = T) >= 2]
[1] "ID"    "color" "owner"

如何在忽略 NA 和空白的情况下处理 select 列中具有等于或大于 2 个唯一值的列？

问题描述

2 个解决方案

解决方案1
2 已采纳 2022-05-09 16:19:30

解决方案2
2 2022-05-09 16:21:49

如何在忽略 NA 和空白的情况下处理 select 列中具有等于或大于 2 个唯一值的列？

问题描述

2 个解决方案

解决方案1 2 已采纳 2022-05-09 16:19:30

解决方案2 2 2022-05-09 16:21:49

解决方案1
2 已采纳 2022-05-09 16:19:30

解决方案2
2 2022-05-09 16:21:49