简体   繁体   English

如何在忽略 NA 和空白的情况下处理 select 列中具有等于或大于 2 个唯一值的列?

[英]How to select columns with equal or more than 2 unique values while ignoring NA and blank?

My dataframe looks similar to this:我的 dataframe 看起来与此类似:

 df <- data.frame(ID = c(1, 2, 3, 4, 5),
               color = c(NA, "black", "black", NA, "brown"),
              animal = c("dog", "", "", "", "")
               owner = c("YES", "NO", "NO", "YES", NA))
ID ID color颜色 animal动物 owner所有者
1 1个 NA北美 dog YES是的
2 2个 black黑色的 NO
3 3个 black黑色的 NO
4 4个 NA北美 YES是的
5 5个 brown棕色的 NA北美

I would like to retrieve the column names of all columns with more than 2 unique values while ignoring NA and blanks/empty strings in this calculation.我想检索具有超过 2 个唯一值的所有列的列名,同时在此计算中忽略 NA 和空白/空字符串。

My solution so far:到目前为止我的解决方案:

df_col <- df %>% 
        select_if(function(col) length(unique(na.omit(col)))>1)

df_col <- colnames(df_col)

But I have noticed that na.omit() won't help, since it deletes the whole row.但我注意到 na.omit() 无济于事,因为它会删除整行。

Any help would be appreciated.任何帮助,将不胜感激。 Thank you in advance!先感谢您!

Use n_distinct , which also have na.rm argument, The _if/_at/_all are deprecated in favor of across/where .使用n_distinct ,它也有na.rm参数,不推荐使用_if/_at/_all取而代之的是across/where The empty strings ( '' ) can be checked with nzchar which returns a TRUE only if it is non-empty, thus subset the elements of the columns with nzchar and then apply n_distinct column wise and create the condition to select only those columns and then get the names可以使用nzchar检查空字符串 ( '' ),它仅在非空时才返回 TRUE,因此使用nzchar对列的元素进行子集化,然后按列应用n_distinct并仅将条件创建到select这些列,然后得到names

library(dplyr)
df %>%
    select(where(~ n_distinct(.x[nzchar(.x)], na.rm = TRUE) > 1)) %>%
     names

-output -输出

[1] "ID"    "color" "owner"

An option is also to convert the "" to NA ( na_if ), perhaps it may be slightly compact一个选项也是将""转换为NA ( na_if ),也许它可能会稍微紧凑

df %>% 
  select(where(~ n_distinct(na_if(.x, ""), na.rm = TRUE) > 1)) %>% 
  names

You can do replace values with "" with NA ( na_if ), and then use lengths to count the number of unique values.您可以使用 NA ( na_if ) 将值替换为“”,然后使用lengths来计算唯一值的数量。 Use names and which to get the vector of names that have more than two values.使用nameswhich获取具有两个以上值的名称向量。

names(which(lengths(lapply(na_if(df, ""), \(x) unique(x[!is.na(x)]))) >= 2))
[1] "ID"    "color" "owner"

Combining with n_distinct :结合n_distinct

colnames(df)[lapply(na_if(df, ""), n_distinct, na.rm = T) >= 2]
[1] "ID"    "color" "owner"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何删除包含超过 2000 个 NA 值的所有列? - How to remove all columns that contain more than 2000 NA values? 子集某些列,同时忽略 NA 值 R - Subsetting certain columns while ignoring NA values R 如果最近n次观察中没有任何NA,如何选择列? 如果相邻NA的观测值多于x,如何删除列? - How to select columns if there is not any NA in the last n observations? How to drop columns if there are more than x adjacent NA's observations? 查找包含5个以上NA值的列的索引 - Find the index of columns containing more than 5 NA values 如何组合 R 中的列,不相等时将值设置为 NA - How to combine columns in R, setting values to NA when not equal 在忽略 NA 值的同时计算 cumsum() - Calculate cumsum() while ignoring NA values 如果字符串中有超过 x 个数字或超过 x 个字母,如何用 NA 替换列中的所有情况? - How replace all cases in columns with NA if there are more than x numbers OR more than x letters in the string? 选择均等的唯一值 - select unique values with equal probability 如何在忽略 NA 的同时计算 R 中的唯一值 - How to count unique values in R while ignoring NAs 如何在忽略字符向量的同时为列的数字单元格着色表示 R 的 gt 包中的 NA - How to color columns' numeric cells while ignoring character vectors represents NA in gt package from R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM