简体   繁体   English

在R中使用is.na获取包含NA值的列名称

[英]Using is.na in R to get Column Names that Contain NA Values

Given the example data set below: 给定以下示例数据集:

df <- as.data.frame(matrix( c(1, 2, 3, NA, 5, NA, 
                              7, NA, 9, 10, NA, NA), nrow=2, ncol=6))

names(df) <- c(  "varA", "varB", "varC", "varD", "varE", "varF")

print(df)

  varA varB varC varD varE varF
1    1    3    5    7    9   NA
2    2   NA   NA   NA   10   NA

I'd like to be able to use kmeans(...) on data sets without having to manually check or delete variables that contain NA anywhere within the variable. 我希望能够在数据集上使用kmeans(...),而不必手动检查或删除变量中任何位置包含NA的变量。 While I'm asking right now for kmeans(...) I'll be using a similar process for other things, so a kmeans(...) specific answer won't totally answer my question. 当我现在要问kmeans(...)时,我将在其他方面使用类似的过程,因此,针对kmeans(...)的特定答案将无法完全回答我的问题。

The manual version of what I'd like is: 我想要的手动版本是:

kmeans_model <- kmeans(df[, -c(2:4, 6)], 10) 

And the pseudo-code would be: 伪代码为:

kmeans_model <- kmeans(df[, -c(colnames(is.na(df)))], 10) 

Also, I don't want to delete the data from df. 另外,我也不想从df中删除数据。 Thanks in advance. 提前致谢。

(Obviously kmeans(...) wouldn't work on this example data set but I can't recreate the real data set) (显然kmeans(...)在此示例数据集上不起作用,但我无法重新创建实际数据集)

Here are two options without sapply : 这是两个没有sapply选项:

kmeans_model <- kmeans(df[, !colSums(is.na(df))], 10) 

Or 要么

kmeans_model <- kmeans(df[, colSums(is.na(df)) == 0], 10) 

Explanation: 说明:

colSums(is.na(df)) counts the number of NAs per column, resulting in: colSums(is.na(df))计算每列NA的数量,结果为:

colSums(is.na(df))
#varA varB varC varD varE varF 
#   0    1    1    1    0    2 

And then 接着

colSums(is.na(df)) == 0     # converts to logical TRUE/FALSE
#varA  varB  varC  varD  varE  varF 
#TRUE FALSE FALSE FALSE  TRUE FALSE 

is the same as 是相同的

!colSums(is.na(df))
#varA  varB  varC  varD  varE  varF 
#TRUE FALSE FALSE FALSE  TRUE FALSE 

Both methods can be used to subset only those columns where the logical value is TRUE 两种方法都可以用于仅对逻辑值为TRUE的那些列进行子集

This is the generic approach that I use for listing column names and their count of NAs: 这是我用于列出列名及其NA计数的通用方法:

sort(colSums(is.na(df)> 0), decreasing = T)

If you want to use sapply, you can refer this code snippet as well: 如果要使用sapply,也可以引用以下代码片段:

flights_NA_cols <- sapply(flights, function(x) sum(is.na(x))) 
flights_NA_cols[flights_NA_cols>0]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM