[英]How to select columns with equal or more than 2 unique values while ignoring NA and blank?
My dataframe looks similar to this:我的 dataframe 看起来与此类似:
df <- data.frame(ID = c(1, 2, 3, 4, 5),
color = c(NA, "black", "black", NA, "brown"),
animal = c("dog", "", "", "", "")
owner = c("YES", "NO", "NO", "YES", NA))
ID ![]() |
color![]() |
animal![]() |
owner![]() |
---|---|---|---|
1 ![]() |
NA![]() |
dog![]() |
YES![]() |
2 ![]() |
black![]() |
NO![]() |
|
3 ![]() |
black![]() |
NO![]() |
|
4 ![]() |
NA![]() |
YES![]() |
|
5 ![]() |
brown![]() |
NA![]() |
I would like to retrieve the column names of all columns with more than 2 unique values while ignoring NA and blanks/empty strings in this calculation.我想检索具有超过 2 个唯一值的所有列的列名,同时在此计算中忽略 NA 和空白/空字符串。
My solution so far:到目前为止我的解决方案:
df_col <- df %>%
select_if(function(col) length(unique(na.omit(col)))>1)
df_col <- colnames(df_col)
But I have noticed that na.omit() won't help, since it deletes the whole row.但我注意到 na.omit() 无济于事,因为它会删除整行。
Any help would be appreciated.任何帮助,将不胜感激。 Thank you in advance!
先感谢您!
Use n_distinct
, which also have na.rm
argument, The _if/_at/_all
are deprecated in favor of across/where
.使用
n_distinct
,它也有na.rm
参数,不推荐使用_if/_at/_all
取而代之的是across/where
。 The empty strings ( ''
) can be checked with nzchar
which returns a TRUE only if it is non-empty, thus subset the elements of the columns with nzchar
and then apply n_distinct
column wise and create the condition to select
only those columns and then get the names
可以使用
nzchar
检查空字符串 ( ''
),它仅在非空时才返回 TRUE,因此使用nzchar
对列的元素进行子集化,然后按列应用n_distinct
并仅将条件创建到select
这些列,然后得到names
library(dplyr)
df %>%
select(where(~ n_distinct(.x[nzchar(.x)], na.rm = TRUE) > 1)) %>%
names
-output -输出
[1] "ID" "color" "owner"
An option is also to convert the ""
to NA
( na_if
), perhaps it may be slightly compact一个选项也是将
""
转换为NA
( na_if
),也许它可能会稍微紧凑
df %>%
select(where(~ n_distinct(na_if(.x, ""), na.rm = TRUE) > 1)) %>%
names
You can do replace values with "" with NA ( na_if
), and then use lengths
to count the number of unique values.您可以使用 NA (
na_if
) 将值替换为“”,然后使用lengths
来计算唯一值的数量。 Use names
and which
to get the vector of names that have more than two values.使用
names
和which
获取具有两个以上值的名称向量。
names(which(lengths(lapply(na_if(df, ""), \(x) unique(x[!is.na(x)]))) >= 2))
[1] "ID" "color" "owner"
Combining with n_distinct
:结合
n_distinct
:
colnames(df)[lapply(na_if(df, ""), n_distinct, na.rm = T) >= 2]
[1] "ID" "color" "owner"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.